PDFToTextConverter Parameters

Check the init and runtime parameters you can specify for the PDFToTextConverter node.

YAML Init Parameters

These are the parameters you can specify in pipeline YAML:

ParameterTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue
False (default)
Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period).
You can find this useful if your pipeline has a Reader that can't parse tables.
Mandatory.
valid_languagesA list of stringsA list of languages in the ISO 639-1 format. Tests for encoding errors for the languages you specify.
Optional.
id_hash_keysA list of strings-Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.
Optional.
sort_by_positionBooleanTrue
False (default)
Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.
True - Sorts the text first by its vertical position and then by its horizontal position.
False - Sorts the text according to the logical reading order in the PDF.
Mandatory.
ocrLiteralauto
full
Default: None
Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.
auto - Uses OCR only to extract text from images and integrate them into the existing text.
full - Uses OCR to extract text from the entire PDF.
Optional.
ocr_languageStringCheck supported languages.
Default: eng
Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu.
multiprocessingBooleanTrue (default)
False
We use multiprocessing to speed up PyMuPDF conversion.
True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.
False - Disables multiprocessing.

REST API Runtime Parameters

There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.