PDFToTextConverter

Before you can run a search on your PDF files using a deepset Cloud pipeline, you must convert these files into Document objects. Documents are passages of plain text that pipelines use for search. Use PDFToTextConverter to convert PDF files to plain text document objects.

PDFToTextConverter preprocesses PDF files and returns documents. These documents are then stored in the Document Store. Documents are what the pipeline uses for search.

File conversion happens only once when you deploy your pipeline. Your files are not converted every time you run the search. If you add a file after you deploy a pipeline, only this file is converted.

PDFToTextConverter takes File as input and produces document as output.

Usage

You can use it in your indexing pipeline as the first node. First, define PDFToTextConverter in the components section of your pipeline definition file:

components:
  - name: PDFConverter
    type: PDFToTextConverter
    params: 
        remove_numeric_tables: True

And then add it to your pipeline:

pipelines:
  - name: indexing
    nodes:
      - name: PDFConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [PDFConverter]

Arguments

You can specify the following arguments for PDFToTextConverter:

ArgumentTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue/FalseDeletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period).
This may be useful for Readers that don't have the table parsing capability.
valid_languagesA list of stringsA list of languages in the ISO 639-1 format. Tests for encoding errors for the languages you specify.
id_hash_keysA list of stringsGenerates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.