PDF Find Text

`POST /v1/pdf/find`

Attributes

Attributes are case-sensitive and should be inside JSON for POST request. for example: { "url": "https://example.com/file1.pdf" }

When using regular expressions in JSON payloads, ensure that backslashes are properly escaped. For example, a single backslash \ should be written as \\.

Attribute	Type	Required	Default	Description
`url`	string	Yes	-	URL to the source file `url` attribute
`callback`	string	No	-	The callback URL (or Webhook) used to receive the POST data. see Webhooks & Callbacks. This is only applicable when `async` is set to `true`.
`httpusername`	string	No	-	HTTP auth user name if required to access source URL.
`httppassword`	string	No	-	HTTP auth password if required to access source URL.
`pages`	string	No	all pages	Specify page indices as comma-separated values or ranges to process (e.g. “0, 1, 2-” or “1, 2, 3-7”). The first-page index is 0. Use ”!” before a number for inverted page numbers (e.g. “!0” for the last page). If not specified, the default configuration processes all pages. The input must be in string format.
`inline`	boolean	No	`false`	Set to true to return results inside the response. Otherwise, the endpoint will return a URL to the output file generated.
`password`	string	No	-	Password for the PDF file.
`async`	boolean	No	`false`	Set `async` to `true` for long processes to run in the background, API will then return a `jobId` which you can use with the Background Job Check endpoint. Also see Webhooks & Callbacks
`searchString`	string	Yes	-	Text to search can support regular expressions if you set the `regexSearch` param to true.
`wordMatchingMode`	string	No	None	WordMatchingMode defines how search terms match PDF text. Modes: `None` (exact string match only), `SmartMatch` (default; flexible word boundary match, includes letters/digits/punctuation), `ExactMatch` (strict word boundaries, whole-word match only).
`regexSearch`	boolean	No	`false`	Set to true to enable regular expression search for the `searchString(s)` parameter.
`profiles`	object	No	-	See Profiles for more information.
`ColumnDetectionMode`	string	No	ContentGroupsAndBorders	Controls column detection/alignment in PDF table extraction. Modes: `ContentGroupsAndBorders` (default; text + lines), `ContentGroups` (text grouping only), `Borders` (lines only), `BorderedTables` (OCR-based for bordered tables), `ContentGroupsAI` (AI for dense/complex layouts).
`DetectionMinNumberOfRows`	integer	No	1	Minimum number of rows to detect in a table
`DetectionMinNumberOfColumns`	integer	No	1	Minimum number of columns to detect in a table
`DetectionMaxNumberOfInvalidSubsequentRowsAllowed`	integer	No	`0`	Maximum number of invalid subsequent rows allowed in a table
`DetectionMinNumberOfLineBreaksBetweenTables`	integer	No	`0`	Minimum number of line breaks between tables
`EnhanceTableBorders`	boolean	No	`true`	Enhance table borders or not
`OCRDetectPageRotation`	boolean	No	`false`	Controls whether to detect page rotation in the PDF document when OCR applied. Set to true to detect page rotation. See Support page rotation for more information.
`DataEncryptionAlgorithm`	string	No	-	Controls the encryption algorithm used for data encryption. See User-Controlled Encryption for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.
`DataEncryptionKey`	string	No	-	Controls the encryption key used for data encryption. See User-Controlled Encryption for more information.
`DataEncryptionIV`	string	No	-	Controls the encryption IV used for data encryption. See User-Controlled Encryption for more information.
`DataDecryptionAlgorithm`	string	No	-	Controls the decryption algorithm used for data decryption. See User-Controlled Encryption for more information. The available algorithms are: `AES128`, `AES192`, `AES256`.
`DataDecryptionKey`	string	No	-	Controls the decryption key used for data decryption. See User-Controlled Encryption for more information.
`DataDecryptionIV`	string	No	-	Controls the decryption IV used for data decryption. See User-Controlled Encryption for more information.
`requestParametersDocument`	string	No	-
`responseParameters`	object	No	-	-
`error`	boolean	No	-	Indicates whether an error occurred (`false` means success)
`status`	string	No	-	Status code of the request (200, 404, 500, etc.). For more information, see Response Codes.
`message`	string	No	-	Message of the request
`credits`	integer	No	-	Number of credits consumed by the request
`remainingCredits`	integer	No	-	Number of credits remaining in the account
`duration`	integer	No	-	Time taken for the operation in milliseconds
`errorCode`	integer	No	-	Error code of the request (400, 401, 402, 403, 404, 500, etc.)

Support page rotation

This endpoint supports PDF page rotation as follows:

{
 "profiles": "{ 'OCRDetectPageRotation': true }"
}

Find only bordered tables

You can limit search to bordered tables only by enabling the legacy table search mode with the following profiles config:

{
 "profiles": "{ 'Mode': 'Legacy',
 'ColumnDetectionMode': 'BorderedTables',
 'DetectionMinNumberOfRows': 1,
 'DetectionMinNumberOfColumns': 1,
 'DetectionMaxNumberOfInvalidSubsequentRowsAllowed': 0,
 'DetectionMinNumberOfLineBreaksBetweenTables': 0,
 'EnhanceTableBorders': false
 }"
}

`Example` Payload

To see the request size limits, please refer to the Request Size Limits.

{
  "async": "false",
  "url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
  "searchString": "Invoice Date \\d+/\\d+/\\d+",
  "regexSearch": "true",
  "name": "output",
  "pages": "0-",
  "inline": "true",
  "wordMatchingMode": "",
  "password": ""
}

`Example` Response

To see the main response codes, please refer to the Response Codes page.

{
  "body": [
    {
      "text": "Invoice Date 01/01/2016",
      "left": 436.5400085449219,
      "top": 130.4599995137751,
      "width": 122.85311957550027,
      "height": 11.040000486224898,
      "pageIndex": 0,
      "bounds": {
        "location": {
          "isEmpty": false,
          "x": 436.54,
          "y": 130.46
        },
        "size": "122.853119, 11.0400009",
        "x": 436.54,
        "y": 130.46,
        "width": 122.853119,
        "height": 11.0400009,
        "left": 436.54,
        "top": 130.46,
        "right": 559.3931,
        "bottom": 141.5,
        "isEmpty": false
      },
      "elementCount": 1,
      "elements": [
        {
          "index": 0,
          "left": 436.5400085449219,
          "top": 130.4599995137751,
          "width": 122.85311957550027,
          "height": 11.040000486224898,
          "angle": 0,
          "text": "Invoice Date 01/01/2016",
          "isNewLine": true,
          "fontIsBold": true,
          "fontIsItalic": false,
          "fontName": "Helvetica-Bold",
          "fontSize": 11,
          "fontColor": "0, 0, 0",
          "fontColorAsOleColor": 0,
          "fontColorAsHtmlColor": "#000000",
          "bounds": {
            "location": {
              "isEmpty": false,
              "x": 436.54,
              "y": 130.46
            },
            "size": "122.853119, 11.0400009",
            "x": 436.54,
            "y": 130.46,
            "width": 122.853119,
            "height": 11.0400009,
            "left": 436.54,
            "top": 130.46,
            "right": 559.3931,
            "bottom": 141.5,
            "isEmpty": false
          }
        }
      ]
    }
  ],
  "pageCount": 1,
  "error": false,
  "status": 200,
  "name": "output",
  "remainingCredits": 59970
}

Code Samples

curl --location --request POST 'https://api.pdf.co/v1/pdf/find' \
--header 'x-api-key: *******************' \
--header 'Content-Type: application/json' \
--data-raw '{
"async": "false",
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"searchString": "Invoice Date \\d+/\\d+/\\d+",
"regexSearch": "true",
"name": "output",
"pages": "0-",
"inline": "true",
"wordMatchingMode": "",
"password": ""
}'

curl --location --request POST 'https://api.pdf.co/v1/pdf/find' \
--header 'x-api-key: *******************' \
--header 'Content-Type: application/json' \
--data-raw '{
"async": "false",
"url": "pdfco-test-files.s3.us-west-2.amazonaws.compdf-to-text/sample.pdf",
"searchString": "Invoice Date \\d+/\\d+/\\d+",
"regexSearch": "true",
"name": "output",
"pages": "0-",
"inline": "true",
"wordMatchingMode": "",
"password": ""
}'

// `request` module is required for file upload.
// Use "npm install request" command to install.
var request = require("request");

// The authentication key (API Key).
// Get your own by registering at https://app.pdf.co
const API_KEY = "***********************************";

// Direct URL of source PDF file.
const SourceFileUrl = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf";

// Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
const Pages = "";

// PDF document password. Leave empty for unprotected documents.
const Password = "";

// Search string.
const SearchString = '[4-9][0-9].[0-9][0-9]'; // Regular expression to find numbers in format dd.dd and between 40.00 to 99.99

// Enable regular expressions (Regex)
const RegexSearch = 'True';

// Prepare URL for PDF text search API call.
// See documentation: https://developer.pdf.co
var query = `https://api.pdf.co/v1/pdf/find`;
let reqOptions = {
    uri: query,
    headers: { "x-api-key": API_KEY },
    formData: {
        password: Password,
        pages: Pages,
        url: SourceFileUrl,
        searchString: SearchString,
        regexSearch: RegexSearch
    }
};

// Send request
request.post(reqOptions, function (error, response, body) {
    if (error) {
        return console.error("Error: ", error);
    }

    // Parse JSON response
    let data = JSON.parse(body);
    for (let index = 0; index < data.body.length; index++) {
        const element = data.body[index];
        console.log("Found text " + element["text"] + " at coordinates " + element["left"] + ", " + element["top"]);
    }

});

import os
import requests # pip install requests

# The authentication key (API Key).
# Get your own by registering at https://app.pdf.co
API_KEY = "******************************************"

# Base URL for PDF.co Web API requests
BASE_URL = "https://api.pdf.co/v1"

# Source PDF file
SourceFile = ".\\sample.pdf"

# Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
Pages = ""

# PDF document password. Leave empty for unprotected documents.
Password = ""

# Search string.
SearchString = "\d{1,}\.\d\d" # Regular expression to find numbers like '100.00'
                              # Note: do not use `+` char in regex, but use `{1,}` instead.
                              # `+` char is valid for URL and will not be escaped, and it will become a space char on the server side.

# Enable regular expressions (Regex)
RegexSearch = True


def main(args = None):
    uploadedFileUrl = uploadFile(SourceFile)
    if (uploadedFileUrl != None):
        searchTextInPDF(uploadedFileUrl)


def searchTextInPDF(uploadedFileUrl):
    """Search Text using PDF.co Web API"""

    # Prepare requests params as JSON
    # See documentation: https://developer.pdf.co
    parameters = {}
    parameters["password"] = Password
    parameters["pages"] = Pages
    parameters["url"] = uploadedFileUrl
    parameters["searchString"] = SearchString
    parameters["regexSearch"] = RegexSearch

    # Prepare URL for 'PDF Text Search' API request
    url = "{}/pdf/find".format(BASE_URL)

    # Execute request and get response as JSON
    response = requests.post(url, data=parameters, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()

        if json["error"] == False:
            # Display found information
            for item in json["body"]:
                print(f"Found text {item['text']} at coordinates {item['left']}, {item['top']}")
        else:
            # Show service reported error
            print(json["message"])
    else:
        print(f"Request error: {response.status_code} {response.reason}")


def uploadFile(fileName):
    """Uploads file to the cloud"""

    # 1. RETRIEVE PRESIGNED URL TO UPLOAD FILE.

    # Prepare URL for 'Get Presigned URL' API request
    url = "{}/file/upload/get-presigned-url?contenttype=application/octet-stream&name={}".format(
        BASE_URL, os.path.basename(fileName))

    # Execute request and get response as JSON
    response = requests.get(url, headers={ "x-api-key": API_KEY })
    if (response.status_code == 200):
        json = response.json()

        if json["error"] == False:
            # URL to use for file upload
            uploadUrl = json["presignedUrl"]
            # URL for future reference
            uploadedFileUrl = json["url"]

            # 2. UPLOAD FILE TO CLOUD.
            with open(fileName, 'rb') as file:
                requests.put(uploadUrl, data=file, headers={ "x-api-key": API_KEY, "content-type": "application/octet-stream" })

            return uploadedFileUrl
        else:
            # Show service reported error
            print(json["message"])
    else:
        print(f"Request error: {response.status_code} {response.reason}")

    return None


if __name__ == '__main__':
    main()

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;

namespace PDFcoApiExample
{
    class Program
    {
        // The authentication key (API Key).
        // Get your own by registering at https://app.pdf.co
        const String API_KEY = "*********************************";

        // Source PDF file
        const string SourceFile = @".\sample.pdf";

        // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
        const string Pages = "";

        // PDF document password. Leave empty for unprotected documents.
        const string Password = "";

        // Search string.
        const string SearchString = @"\d{1,}\.\d\d"; // Regular expression to find numbers like '100.00'
                                                     // Note: do not use `+` char in regex, but use `{1,}` instead.
                                                     // `+` char is valid for URL and will not be escaped, and it will become a space char on the server side.

        // Enable regular expressions (Regex)
        const bool RegexSearch = true;


        static void Main(string[] args)
        {
            // Create standard .NET web client instance
            WebClient webClient = new WebClient();

            // Set API Key
            webClient.Headers.Add("x-api-key", API_KEY);

            // 1. RETRIEVE THE PRESIGNED URL TO UPLOAD THE FILE.
            // * If you already have a direct file URL, skip to the step 3.

            // Prepare URL for `Get Presigned URL` API call
            string query = Uri.EscapeUriString(string.Format(
                "https://api.pdf.co/v1/file/upload/get-presigned-url?contenttype=application/octet-stream&name={0}",
                Path.GetFileName(SourceFile)));

            try
            {
                // Execute request
                string response = webClient.DownloadString(query);

                // Parse JSON response
                JObject json = JObject.Parse(response);

                if (json["error"].ToObject<bool>() == false)
                {
                    // Get URL to use for the file upload
                    string uploadUrl = json["presignedUrl"].ToString();
                    string uploadedFileUrl = json["url"].ToString();

                    // 2. UPLOAD THE FILE TO CLOUD.
                    webClient.Headers.Add("content-type", "application/octet-stream");
                    webClient.UploadFile(uploadUrl, "PUT", SourceFile); // You can use UploadData() instead if your file is byte[] or Stream

                    // 3. MAKE UPLOADED PDF FILE SEARCHABLE

                    // URL for `PDF Text Search` API call
                    // See documentation: https://developer.pdf.co
                    string url = "https://api.pdf.co/v1/pdf/find";

                    // Prepare requests params as JSON
                    Dictionary<string, object> parameters = new Dictionary<string, object>();
                    parameters.Add("password", Password);
                    parameters.Add("pages", Pages);
                    parameters.Add("url", uploadedFileUrl);
                    parameters.Add("searchString", SearchString);
                    parameters.Add("regexSearch", RegexSearch);

                    // Convert dictionary of params to JSON
                    string jsonPayload = JsonConvert.SerializeObject(parameters);

                    // Execute POST request with JSON payload
                    response = webClient.UploadString(url, jsonPayload);

                    // Parse JSON response
                    json = JObject.Parse(response);

                    if (json["error"].ToObject<bool>() == false)
                    {
                        foreach (JToken item in json["body"])
                        {
                            Console.WriteLine($"Found text \"{item["text"]}\" at coordinates {item["left"]}, {item["top"]}");
                        }
                    }
                    else
                    {
                        Console.WriteLine(json["message"].ToString());
                    }
                }
                else
                {
                    Console.WriteLine(json["message"].ToString());
                }
            }
            catch (WebException ex)
            {
                Console.WriteLine(ex.ToString());
            }

            webClient.Dispose();

            Console.WriteLine();
            Console.WriteLine("Press any key...");
            Console.ReadKey();
        }
    }
}

package com.company;

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import okhttp3.*;

import java.io.*;
import java.net.*;

public class Main
{
    // The authentication key (API Key).
    // Get your own by registering at https://app.pdf.co
    final static String API_KEY = "***********************************";

    // Direct URL of source PDF file.
    final static String SourceFileURL = "https://bytescout-com.s3.amazonaws.com/files/demo-files/cloud-api/pdf-to-text/sample.pdf";

    // Comma-separated list of page indices (or ranges) to process. Leave empty for all pages. Example: '0,2-5,7-'.
    final static String Pages = "";

    // PDF document password. Leave empty for unprotected documents.
  final static String Password = "";

    // Search string.
  final static String SearchString = "\\d{1,}\\.\\d\\d"; // Regular expression to find numbers like '100.00'
    // Note: do not use `+` char in regex, but use `{1,}` instead.
    // `+` char is valid for URL and will not be escaped, and it will become a space char on the server side.

    // Enable regular expressions (Regex)
    final static boolean RegexSearch = true;

    public static void main(String[] args) throws IOException
    {
        // Create HTTP client instance
        OkHttpClient webClient = new OkHttpClient();

        // Prepare URL for PDF text search API call.
        // See documentation: https://developer.pdf.co
        String query = "https://api.pdf.co/v1/pdf/find";

        // Make correctly escaped (encoded) URL
        URL url = null;
        try
        {
            url = new URI(null, query, null).toURL();
        }
        catch (URISyntaxException e)
        {
            e.printStackTrace();
        }

        // Create JSON payload
    String jsonPayload = String.format("{\"password\": \"%s\", \"pages\": \"%s\", \"url\": \"%s\", \"searchString\": \"%s\", \"regexSearch\": \"%s\"}",
                Password,
                Pages,
                SourceFileURL,
                SearchString,
                RegexSearch);

        // Prepare request body
        RequestBody body = RequestBody.create(MediaType.parse("application/json"), jsonPayload);

        // Prepare request
        Request request = new Request.Builder()
            .url(url)
            .addHeader("x-api-key", API_KEY) // (!) Set API Key
            .addHeader("Content-Type", "application/json")
            .post(body)
            .build();

        // Execute request
        Response response = webClient.newCall(request).execute();

        if (response.code() == 200)
        {
            // Parse JSON response
            JsonObject json = new JsonParser().parse(response.body().string()).getAsJsonObject();

            boolean error = json.get("error").getAsBoolean();
            if (!error)
            {
                // Display found items in console
                for (JsonElement element : json.get("body").getAsJsonArray())
                {
                    JsonObject item = (JsonObject) element;
                    System.out.println("Found text " + item.get("text") + " at coordinates " + item.get("left") + ", "+ item.get("top"));
                }
            }
            else
            {
                // Display service reported error
                System.out.println(json.get("message").getAsString());
            }
        }
        else
        {
            // Display request error
            System.out.println(response.code() + " " + response.message());
        }
    }
}

Welcome

Extraction

Editing

PDF Conversion

Excel Conversion

PDF Merging & Splitting

Forms

Find & Search

Document, File & System

Pages

Barcodes

Glossary

`POST /v1/pdf/find`

Attributes

Support page rotation

Find only bordered tables

`Example` Payload

`Example` Response

Code Samples

Welcome

Extraction

Editing

PDF Conversion

Excel Conversion

PDF Merging & Splitting

Forms

Find & Search

Document, File & System

Pages

Barcodes

Glossary

​POST /v1/pdf/find

​Attributes

​Support page rotation

​Find only bordered tables

​Example Payload

​Example Response

​Code Samples

`POST /v1/pdf/find`

Attributes

Support page rotation

Find only bordered tables

`Example` Payload

`Example` Response

Code Samples