PDFToTextConverter extracts text from PDF files and returns Documents. These Documents are then stored in the DocumentStore. Documents are what the pipeline uses for search.

File conversion happens only once when you deploy your pipeline. Your files are not converted every time you search. If you add a file after you deploy a pipeline, only this file is converted.

PDFToTextConverter takes File as input and produces Document as output.

Basic Information

Pipeline type: Used in indexing pipelines.
Nodes that can precede it in a pipeline:: FileTypeClassifier
Nodes that can follow it in a pipeline: PreProcessor
Node input: File
Node output: Document
Available node classes: PDFToTextConverter (uses xpdf to extract text from PDF files)

Usage Example

...
components:
  - name: PDFConverter
    type: PDFToTextConverter
    params: 
    	remove_numeric_tables: True
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [PDFConverter]
...

Parameters

You can specify the following parameters for PDFToTextConverter in the pipeline YAML:

Parameter	Type	Possible Values	Description
`remove_numeric_tables`	Boolean	`True` `False` (default)	Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). You can find this useful if your pipeline has a Reader that can't parse tables. Mandatory.
`valid_languages`	A list of strings	A list of languages in the ISO 639-1 format.	Tests for encoding errors for the languages you specify. Optional.
`id_hash_keys`	A list of strings	-	Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: `["content", "meta"]` to this field. Optional.
`sort_by_position`	Boolean	`True` `False` (default)	Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order. `True` - Sorts the text first by its vertical position and then by its horizontal position. `False` - Sorts the text according to the logical reading order in the PDF. Mandatory.
`ocr`	Literal	`auto` `full` Default: None	Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF. `auto` - Uses OCR only to extract text from images and integrate them into the existing text. `full` - Uses OCR to extract text from the entire PDF. Optional.
`ocr_language`	String	Check supported languages. Default: `eng`	Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass `eng+deu`.
`multiprocessing`	Boolean	`True` (default) `False`	We use multiprocessing to speed up PyMuPDF conversion. `True` - Uses the total number of cores. To specify the number of cores to use, set this value to an integer. `False` - Disables multiprocessing.