PDFToTextConverter

Before you can run a search on your PDF files using a deepset Cloud pipeline, you must convert these files into Document objects. Use PDFToTextConverter`to convert PDF files to plain text Document objects.

PDFToTextConverter extracts text from PDF files and returns Documents. These Documents are then stored in the DocumentStore. Documents are what the pipeline uses for search.

File conversion happens only once when you deploy your pipeline. Your files are not converted every time you search. If you add a file after you deploy a pipeline, only this file is converted.

PDFToTextConverter takes File as input and produces Document as output.

Basic Information

  • Pipeline type: Used in indexing pipelines.
  • Nodes that can precede it in a pipeline:: FileTypeClassifier
  • Nodes that can follow it in a pipeline: PreProcessor
  • Node input: File
  • Node output: Document
  • Available node classes: PDFToTextConverter (uses xpdf to extract text from PDF files)

Usage Example

...
components:
  - name: PDFConverter
    type: PDFToTextConverter
    params: 
    	remove_numeric_tables: True
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [PDFConverter]
...

Arguments

You can specify the following arguments for PDFToTextConverter:

ArgumentTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue
False (default)
Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period).
You can find this useful if your pipeline has a Reader that can't parse tables.
Mandatory.
valid_languagesA list of stringsA list of languages in the ISO 639-1 format. Tests for encoding errors for the languages you specify.
Optional.
id_hash_keysA list of strings-Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.
Optional.
sort_by_positionBooleanTrue
False (default)
Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.
True - Sorts the text first by its vertical position and then by its horizontal position.
False - Sorts the text according to the logical reading order in the PDF.
Mandatory.
ocrLiteralauto
full
Default: None
Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.
auto - Uses OCR only to extract text from images and integrate them into the existing text.
full - Uses OCR to extract text from the entire PDF.
Optional.
ocr_languageStringCheck supported languages.
Default: eng
Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu.
multiprocessingBooleanTrue (default)
False
We use multiprocessing to speed up PyMuPDF conversion.
True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.
False - Disables multiprocessing.