Document Classifier

`POST /v1/pdf/classifier`

Document Classifier can automatically find class of input PDF, JPG, PNG document by analyzing its content using the built-in AI or custom defined classification rules.

The best way to develop, test and maintain classification rules is to use Classifier Tester Tool from PDF.co Document Classifier UI . Use this tool to quickly edit and test rules on single PDFs and on folders.

Attributes

Attributes are case-sensitive and should be inside JSON for POST request. for example: { "url": "https://example.com/file1.pdf" }

Attribute	Type	Required	Default	Description
`url`	string	Yes	-	URL to the source file `url` attribute
`callback`	string	No	-	The callback URL (or Webhook) used to receive the POST data. see Webhooks & Callbacks. This is only applicable when `async` is set to `true`.
`httpusername`	string	No	-	HTTP auth user name if required to access source URL.
`httppassword`	string	No	-	HTTP auth password if required to access source URL.
`inline`	boolean	No	`false`	Set to true to return results inside the response. Otherwise, the endpoint will return a URL to the output file generated.
`password`	string	No	-	Password for the PDF file.
`async`	boolean	No	`false`	Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with the Background Job Check endpoint. Also see Webhooks & Callbacks
`name`	string	No	-	File name for the generated output, the input must be in string format.
`expiration`	integer	No	`60`	Set the expiration time for the output link in minutes. After this specified duration, any generated output file(s) will be automatically deleted from PDF.co Temporary Files Storage. The maximum duration for link expiration varies based on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf templates, documents) consider using PDF.co Built-In Files Storage.
`caseSensitive`	boolean	No	`true`	Set to `false` to don’t use case-sensitive search.
`rulescsv`	string	No	-	Define custom classification rules in CSV format. See the rulescsv.
`rulescsvurl`	string	No	-	URL to the CSV file with classification rules. For the format, see the description above `rulescsv` parameter
`profiles`	object	No	-	See Profiles for more information.
`RenderTextObjects`	boolean	No	`true`	Render text objects or not
`RenderVectorObjects`	boolean	No	`true`	Render vector objects or not
`RenderImageObjects`	boolean	No	`true`	Render image objects or not
`TIFFCompression`	string	No	`LZW`	TIFF compression algorithm. The options are: `None`, `LZW`, `CCITT3`, `CCITT4`, `RLE`
`OCRMode`	string	No	`Auto`	Specifies how OCR (Optical Character Recognition) should process input content, offering various modes to tailor text extraction based on content type such as images, fonts, and vector graphics. For more information, see OCR Extraction Modes.
`OCRResolution`	integer	No	`300`	Use this parameter to change the OCR resolution from the default 300 dpi. The range is from `72` to `1200` dpi.
`RotationAngle`	integer	No	-	Use manual rotation to handle PDFs with vertically drawn text. Normally, OCR automatically detects page rotation in PDFs and extracts text accurately. However, in some cases, the PDF might not have an actual rotated page --- Rather, the text itself is drawn vertically. In such scenarios, auto-detection may fail. You can use this parameter to manually set the page rotation. The available angles are: `0`, `1`, `2`, `3`.
`LineGroupingMode`	string	No	`None`	Controls line grouping in PDF text extraction. Modes: `None` (no grouping), `GroupByRows` (merge rows if all cells align), `GroupByColumns` (merge cells by column), `JoinOrphanedRows` (merge single-cell rows to above if no separator).
`ConsiderFontColors`	boolean	No	`false`	Controls whether font colors should be considered when detecting table structure and merging text objects during PDF extraction. Set to true to consider font colors.
`DetectNewColumnBySpacesRatio`	string	No	`1.2`	Controls how spaces between words are interpreted for column detection in PDF text extraction. It defines the ratio of space width that determines when text should be treated as being in separate columns.
`AutoAlignColumnsToHeader`	boolean	No	`true`	Controls how columns are detected and aligned during table extraction from PDF documents. It affects both table structure detection and text extraction with formatting preservation. Set to true to automatically align columns to the header row. When set to true (default), the row with the most columns is used as the header, and all other rows are aligned to this structure --- ideal for well-structured tables. When set to false, columns are analyzed independently across all rows to build the structure, which works better for inconsistent or irregular tables.
`OCRImagePreprocessingFilters.AddGammaCorrection()`	array[string (float format)]	No	`["1.4"]`	Adds a gamma correction filter to the image preprocessing pipeline used during OCR (Optical Character Recognition). This filter adjusts the brightness and contrast of an image by applying a non-linear gamma correction to improve text recognition quality.
`OCRImagePreprocessingFilters.AddGrayscale()`	boolean	No	`false`	Set to true to preprocessing filter that converts a colored document/image to grayscale before performing OCR
`DataEncryptionAlgorithm`	string	No	-	Controls the encryption algorithm used for data encryption. See User-Controlled Encryption for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.
`DataEncryptionKey`	string	No	-	Controls the encryption key used for data encryption. See User-Controlled Encryption for more information.
`DataEncryptionIV`	string	No	-	Controls the encryption IV used for data encryption. See User-Controlled Encryption for more information.
`DataDecryptionAlgorithm`	string	No	-	Controls the decryption algorithm used for data decryption. See User-Controlled Encryption for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.
`DataDecryptionKey`	string	No	-	Controls the decryption key used for data decryption. See User-Controlled Encryption for more information.
`DataDecryptionIV`	string	No	-	Controls the decryption IV used for data decryption. See User-Controlled Encryption for more information.

`rulescsv`

Rules are in CSV format where each row contains: class name, logic (AND or OR (default)), and keywords separated by a comma. Each row is separated by the \n symbol. You can use regular expressions for keywords with this syntax: /keyword or regexp/i where i is the case-insensitive flag. Please note that all \ symbols should add the prefix \ because of JSON format, so \d becomes \\d and so on.

Custom Rules Example 1 for rulescsv (for more examples please check the Document Classifier Usage Guide)
Amazon AWS, OR, Amazon Web Services Invoice, Amazon CloudFront\nDigital Ocean, OR,DigitalOcean, DOInvoice\nACME,OR, ACME Inc.,1540 Long Street

Custom Rules Example 2 (with regular expressions, for more examples please check the Document Classifier Usage Guide)
Medical Report,AND,/Instructing Party|Medical Report|Date Of Injury|Med Agency Ref/i\r\nInjured Claimant,OR, Injured Claimant, Injured Patient ID