> ## Documentation Index
> Fetch the complete documentation index at: https://docs.pdf.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Classifier

> Use this endpoint to automatically sort / detect the class of the document based on keywords-based rules. For example, you can define rules to find which vendor provided the document to find which template to apply accordingly.

## `POST /v1/pdf/classifier`

Document Classifier can automatically find class of input PDF, JPG, PNG document by analyzing its content using the built-in AI or custom defined classification rules.

The best way to **develop**, **test** and **maintain** classification rules is to use `Classifier Tester Tool` from PDF.co [Document Classifier UI](https://app.pdf.co/document-classifier) . Use this tool to quickly edit and test rules on single PDFs and on folders.

## Attributes

<Note>Attributes are case-sensitive and should be inside JSON for POST request. for example: `{ "url": "https://example.com/file1.pdf" }`</Note>

| Attribute                                               | Type                          | Required | Default   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ------------------------------------------------------- | ----------------------------- | -------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `url`                                                   | string                        | *Yes*    | -         | URL to the source file [`url` attribute](/api-reference/url-input-and-request-limits#supported-file-sources)                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `callback`                                              | string                        | *No*     | -         | The callback URL (or Webhook) used to receive the POST data. see [Webhooks & Callbacks](/api-reference/webhooks). This is only applicable when `async` is set to `true`.                                                                                                                                                                                                                                                                                                                                                                                                           |
| `httpusername`                                          | string                        | *No*     | -         | HTTP auth user name if required to access source URL.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| `httppassword`                                          | string                        | *No*     | -         | HTTP auth password if required to access source URL.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| `inline`                                                | boolean                       | *No*     | `false`   | Set to true to return results inside the response. Otherwise, the endpoint will return a URL to the output file generated.                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| `password`                                              | string                        | *No*     | -         | Password for the PDF file.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| `async`                                                 | boolean                       | *No*     | `false`   | Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with the [Background Job Check endpoint](/api-reference/job-check). Also see [Webhooks & Callbacks](/api-reference/webhooks)                                                                                                                                                                                                                                                                                                                                   |
| `name`                                                  | string                        | *No*     | -         | File name for the generated output, the input must be in string format.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `expiration`                                            | integer                       | *No*     | `60`      | Set the expiration time for the output link in minutes. After this specified duration, any generated output file(s) will be automatically deleted from [PDF.co Temporary Files Storage](/api-reference/file-upload/overview). The maximum duration for link expiration varies based on your current subscription plan. To store permanent input files (e.g. re-usable images, pdf templates, documents) consider using [PDF.co Built-In Files Storage](https://app.pdf.co/tools/files).                                                                                            |
| `caseSensitive`                                         | boolean                       | *No*     | `true`    | Set to `false` to don't use case-sensitive search.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| `rulescsv`                                              | string                        | *No*     | -         | Define custom classification rules in CSV format. See the [rulescsv](#rulescsv).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| `rulescsvurl`                                           | string                        | *No*     | -         | URL to the CSV file with classification rules. For the format, see the description above `rulescsv` parameter                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `profiles`                                              | object                        | *No*     | -         | See [Profiles](/api-reference/profiles) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|     `RenderTextObjects`                                 | boolean                       | *No*     | `true`    | Render text objects or not                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|     `RenderVectorObjects`                               | boolean                       | *No*     | `true`    | Render vector objects or not                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|     `RenderImageObjects`                                | boolean                       | *No*     | `true`    | Render image objects or not                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|     `TIFFCompression`                                   | string                        | *No*     | `LZW`     | TIFF compression algorithm. The options are: `None`, `LZW`, `CCITT3`, `CCITT4`, `RLE`                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|     `OCRMode`                                           | string                        | *No*     | `Auto`    | Specifies how OCR (Optical Character Recognition) should process input content, offering various modes to tailor text extraction based on content type such as images, fonts, and vector graphics. For more information, see [OCR Extraction Modes](/api-reference/profiles#ocr-extraction-modes).                                                                                                                                                                                                                                                                                 |
|     `OCRResolution`                                     | integer                       | *No*     | `300`     | Use this parameter to change the OCR resolution from the default 300 dpi. The range is from `72` to `1200` dpi.                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|     `RotationAngle`                                     | integer                       | *No*     | -         | Use manual rotation to handle PDFs with vertically drawn text. Normally, OCR automatically detects page rotation in PDFs and extracts text accurately. However, in some cases, the PDF might not have an actual rotated page  ---  Rather, the text itself is drawn vertically. In such scenarios, auto-detection may fail. You can use this parameter to manually set the page rotation. The available angles are: `0`, `1`, `2`, `3`.                                                                                                                                            |
|     `LineGroupingMode`                                  | string                        | *No*     | `None`    | Controls line grouping in PDF text extraction. Modes: `None` (no grouping), `GroupByRows` (merge rows if all cells align), `GroupByColumns` (merge cells by column), `JoinOrphanedRows` (merge single-cell rows to above if no separator).                                                                                                                                                                                                                                                                                                                                         |
|     `ConsiderFontColors`                                | boolean                       | *No*     | `false`   | Controls whether font colors should be considered when detecting table structure and merging text objects during PDF extraction. Set to true to consider font colors.                                                                                                                                                                                                                                                                                                                                                                                                              |
|     `DetectNewColumnBySpacesRatio`                      | string                        | *No*     | `1.2`     | Controls how spaces between words are interpreted for column detection in PDF text extraction. It defines the ratio of space width that determines when text should be treated as being in separate columns.                                                                                                                                                                                                                                                                                                                                                                       |
|     `AutoAlignColumnsToHeader`                          | boolean                       | *No*     | `true`    | Controls how columns are detected and aligned during table extraction from PDF documents. It affects both table structure detection and text extraction with formatting preservation. Set to true to automatically align columns to the header row. When set to true (default), the row with the most columns is used as the header, and all other rows are aligned to this structure --- ideal for well-structured tables. When set to false, columns are analyzed independently across all rows to build the structure, which works better for inconsistent or irregular tables. |
|     `OCRImagePreprocessingFilters.AddGammaCorrection()` | array\[string (float format)] | *No*     | `["1.4"]` | Adds a gamma correction filter to the image preprocessing pipeline used during OCR (Optical Character Recognition). This filter adjusts the brightness and contrast of an image by applying a non-linear gamma correction to improve text recognition quality.                                                                                                                                                                                                                                                                                                                     |
|     `OCRImagePreprocessingFilters.AddGrayscale()`       | boolean                       | *No*     | `false`   | Set to true to preprocessing filter that converts a colored document/image to grayscale before performing OCR                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|     `OCRAutoModeMinExistingTextLength`                  | integer                       | *No*     | `8`       | The minimum number of characters a page must have to skip OCR. If a page has fewer, OCR will run. For example, if set to 8, OCR is skipped on pages with more than 8 characters.                                                                                                                                                                                                                                                                                                                                                                                                   |
|     `DataEncryptionAlgorithm`                           | string                        | *No*     | -         | Controls the encryption algorithm used for data encryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.                                                                                                                                                                                                                                                                                                                                                          |
|     `DataEncryptionKey`                                 | string                        | *No*     | -         | Controls the encryption key used for data encryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                            |
|     `DataEncryptionIV`                                  | string                        | *No*     | -         | Controls the encryption IV used for data encryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                             |
|     `DataDecryptionAlgorithm`                           | string                        | *No*     | -         | Controls the decryption algorithm used for data decryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.                                                                                                                                                                                                                                                                                                                                                          |
|     `DataDecryptionKey`                                 | string                        | *No*     | -         | Controls the decryption key used for data decryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                            |
|     `DataDecryptionIV`                                  | string                        | *No*     | -         | Controls the decryption IV used for data decryption. See [User-Controlled Encryption](/knowledgebase/user-controlled-encryption) for more information.                                                                                                                                                                                                                                                                                                                                                                                                                             |

### `rulescsv`

Rules are in CSV format where each row contains: `class name`, `logic` (`AND` or `OR` (default)), and keywords separated by a comma. Each row is separated by the `\n` symbol. You can use regular expressions for keywords with this syntax: `/keyword or regexp/i` where `i` is the case-insensitive flag. Please note that all `\` symbols should add the prefix `\` because of JSON format, so `\d` becomes `\\d` and so on.

> **Custom Rules Example 1** for `rulescsv`.
>
> ```
> Amazon AWS, OR, Amazon Web Services Invoice, Amazon CloudFront\nDigital Ocean, OR,DigitalOcean, DOInvoice\nACME,OR, ACME Inc.,1540 Long Street
> ```

> **Custom Rules Example 2**.
>
> ```
> Medical Report,AND,/Instructing Party|Medical Report|Date Of Injury|Med Agency Ref/i\r\nInjured Claimant,OR, Injured Claimant, Injured Patient ID
> ```

## Document Classifier Usage Guide

This Document Classifier checks content of input PDF, JPG, PNG, or TIFF. It uses AI to automatically determine the class of the document (e.g., `finance`, `invoice`) and returns the result to the user. Custom-defined classification rules can also be used.

Use this Document Classifier to quickly build a workflow for sorting input documents and PDF files.

### How to Create and Test Custom Classification Rules

Classification rules are stored in CSV format, one line per class, with the following format:

```
className, logicType, keyword1, keyword2, keyword3 ...
```

Where:

* `className` – The name of the class. It will be returned if rules from this class match the document.
* `logicType` – (Optional) Logic to use for keywords. Can be `OR` (default) or `AND`. `OR` means the class is identified if one or more keywords match. `AND` means **all** keywords must match. If not specified, `OR` is assumed.
* `keyword1`, `keyword2`, `keyword3` – Keywords or phrases to check. Can include regular expressions, e.g., `/\d+/` or `/Medical Report|Med Report/i`.

### Sample Rules

```
Invoice,OR,Invoice Number,Invoice #,Invoice No,Tax Invoice,,
Purchase Order,OR,PO Number,Order Number,Order No,,,
Bill,OR,Bill Date,Billing Period,Bill Number,,,
Bank Statement,OR,/Account Statement/i,/Statement of Account/i,Business Checking,Accounts Payable,/Statement No/i,
Income Statement,OR,/Income Statement/i,,,,,
Has US Number,OR,"/\b-?(\d+,?)+(\.\d\d)\b/",,,,,
Medical Report,AND,/Medical Report|Med Report/i
```

## Query parameters

*No query parameters accepted.*

## Responses

| Parameter          | Type    | Description                                                                                                                                                                                              |
| ------------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `body`             | object  | Response body.                                                                                                                                                                                           |
| `pageCount`        | integer | Number of pages in the PDF document.                                                                                                                                                                     |
| `error`            | boolean | Indicates whether an error occurred (`false` means success)                                                                                                                                              |
| `status`           | string  | Status code of the request (200, 404, 500, etc.). For more information, see [Response Codes](/api-reference/response-codes).. For more information, see [Response Codes](/api-reference/response-codes). |
| `credits`          | integer | Number of credits consumed by the request                                                                                                                                                                |
| `remainingCredits` | integer | Number of credits remaining in the account                                                                                                                                                               |
| `duration`         | integer | Time taken for the operation in milliseconds                                                                                                                                                             |

## `Example` Payload

<Note>To see the request size limits, please refer to the [Request Size Limits](/api-reference/url-input-and-request-limits#pdf-co-request-size).</Note>

```json theme={null}
{
  "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
  "async": false,
  "inline": "true",
  "password": "",
  "profiles": ""
}
```

## `Example` Response

<Note>To see the main response codes, please refer to the [Response Codes](/api-reference/response-codes) page.</Note>

```json theme={null}
{
  "body": {
    "classes": [
      {
        "class": "invoice"
      },
      {
        "class": "finance"
      },
      {
        "class": "documents"
      }
    ]
  },
  "pageCount": 1,
  "error": false,
  "status": 200,
  "credits": 42,
  "duration": 353,
  "remainingCredits": 98019328
}
```

<Note>
  **Inconsistent URL Encoding in cURL Output:** When using cURL to make API requests, the output JSON may show URL characters encoded as Unicode escape sequences. For example, the ampersand character (`&`) may appear as `\u0026` in the cURL output. This is normal JSON encoding behavior and does not affect the validity of the URL. The URL will function correctly when used, as JSON parsers automatically decode these escape sequences. If you're parsing the response programmatically, your JSON parser will handle this conversion automatically.
</Note>

## Code Samples

<Tabs>
  <Tab title="CURL">
    ```bash theme={null}
    curl --location --request POST 'https://api.pdf.co/v1/pdf/classifier' \
    --header 'Content-Type: application/json' \
    --header 'x-api-key: *******************' \
    --data-raw '{
    "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
    "async": false,
    "inline": "true",
    "password": "",
    "profiles": ""
    } '
    ```
  </Tab>

  <Tab title="JavaScript/Node.js">
    ```javascript theme={null}
    var request = require('request');
      var options = {
        'method': 'POST',
        'url': 'https://api.pdf.co/v1/pdf/classifier',
        'headers': {
          'Content-Type': 'application/json',
          'x-api-key': 'YOUR_PDFCO_API_KEY'
        },
        body: JSON.stringify({
          "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
          "async": false,
          "encrypt": "false",
          "inline": "true",
          "password": "",
          "profiles": ""
        })

      };
      request(options, function (error, response) {
        if (error) throw new Error(error);
        console.log(response.body);
      });
    ```
  </Tab>

  <Tab title="C#">
    ```csharp theme={null}
    using System;
      using RestSharp;
      namespace HelloWorldApplication {
              class HelloWorld {
                      static void Main(string[] args) {
                              var client = new RestClient("https://api.pdf.co/v1/pdf/classifier");
                              client.Timeout = -1;
                              var request = new RestRequest(Method.POST);
                              request.AddHeader("Content-Type", "application/json");
                              request.AddHeader("x-api-key", "YOUR_PDFCO_API_KEY");
                              var body = @"{" + "\n" +
                              @"    ""url"": ""https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf""," + "\n" +
                              @"    ""async"": false," + "\n" +
                              @"    ""encrypt"": ""false""," + "\n" +
                              @"    ""inline"": ""true""," + "\n" +
                              @"    ""password"": """"," + "\n" +
                              @"    ""profiles"": """"" + "\n" +
                              @"} ";
                              request.AddParameter("application/json", body,  ParameterType.RequestBody);
                              IRestResponse response = client.Execute(request);
                              Console.WriteLine(response.Content);
                      }
              }
      }
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={null}
    import java.io.*;
      import okhttp3.*;
      public class main {
              public static void main(String []args) throws IOException{
                      OkHttpClient client = new OkHttpClient().newBuilder()
                              .build();
                      MediaType mediaType = MediaType.parse("application/json");
                      RequestBody body = RequestBody.create(mediaType, "{\n    \"url\": \"https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf\",\n    \"async\": false,\n    \"encrypt\": \"false\",\n    \"inline\": \"true\",\n    \"password\": \"\",\n    \"profiles\": \"\"\n} ");
                      Request request = new Request.Builder()
                              .url("https://api.pdf.co/v1/pdf/classifier")
                              .method("POST", body)
                              .addHeader("Content-Type", "application/json")
                              .addHeader("x-api-key", "YOUR_PDFCO_API_KEY")
                              .build();
                      Response response = client.newCall(request).execute();
                      System.out.println(response.body().string());
              }
      }
    ```
  </Tab>

  <Tab title="PHP">
    ```php theme={null}
    <?php

          $curl = curl_init();

          curl_setopt_array($curl, array(
                  CURLOPT_URL => 'https://api.pdf.co/v1/pdf/classifier',
                  CURLOPT_RETURNTRANSFER => true,
                  CURLOPT_ENCODING => '',
                  CURLOPT_MAXREDIRS => 10,
                  CURLOPT_TIMEOUT => 0,
                  CURLOPT_FOLLOWLOCATION => true,
                  CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
                  CURLOPT_CUSTOMREQUEST => 'POST',
                  CURLOPT_POSTFIELDS =>'{
              "url": "https://pdfco-test-files.s3.us-west-2.amazonaws.com/document-parser/sample-invoice.pdf",
              "async": false,
              "encrypt": "false",
              "inline": "true",
              "password": "",
              "profiles": ""
          } ',
                  CURLOPT_HTTPHEADER => array(
                          'Content-Type: application/json',
                          'x-api-key: YOUR_PDFCO_API_KEY'
                  ),
          ));

          $response = json_decode(curl_exec($curl));

          curl_close($curl);
          echo "<h2>Output:</h2><pre>", var_export($response, true), "</pre>";

      ?>
    ```
  </Tab>
</Tabs>
