PDFToTextConverter
Before you can run a search on your PDF files using a deepset Cloud pipeline, you must convert these files into Document objects. Use PDFToTextConverter`to convert PDF files to plain text Document objects.
PDFToTextConverter extracts text from PDF files and returns Documents. These Documents are then stored in the DocumentStore. Documents are what the pipeline uses for search.
File conversion happens only once when you deploy your pipeline. Your files are not converted every time you search. If you add a file after you deploy a pipeline, only this file is converted.
PDFToTextConverter takes File
as input and produces Document
as output.
Basic Information
- Pipeline type: Used in indexing pipelines.
- Nodes that can precede it in a pipeline:: FileTypeClassifier
- Nodes that can follow it in a pipeline: PreProcessor
- Node input: File
- Node output: Document
- Available node classes: PDFToTextConverter (uses xpdf to extract text from PDF files)
Usage Example
...
components:
- name: PDFConverter
type: PDFToTextConverter
params:
remove_numeric_tables: True
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
- name: Preprocessor
inputs: [PDFConverter]
...
Parameters
You can specify the following parameters for PDFToTextConverter
in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True False (default) | Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). You can find this useful if your pipeline has a Reader that can't parse tables. Mandatory. |
valid_languages | A list of strings | A list of languages in the ISO 639-1 format. | Tests for encoding errors for the languages you specify. Optional. |
id_hash_keys | A list of strings | - | Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.Optional. |
sort_by_position | Boolean | True False (default) | Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.True - Sorts the text first by its vertical position and then by its horizontal position.False - Sorts the text according to the logical reading order in the PDF.Mandatory. |
ocr | Literal | auto full Default: None | Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.auto - Uses OCR only to extract text from images and integrate them into the existing text.full - Uses OCR to extract text from the entire PDF.Optional. |
ocr_language | String | Check supported languages. Default: eng | Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu . |
multiprocessing | Boolean | True (default)False | We use multiprocessing to speed up PyMuPDF conversion.True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.False - Disables multiprocessing. |
Updated 8 months ago