Check the init and runtime parameters you can specify for the PDFToTextConverter node.
YAML Init Parameters
These are the parameters you can specify in pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True False (default) | Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). You can find this useful if your pipeline has a Reader that can't parse tables. Mandatory. |
valid_languages | A list of strings | A list of languages in the ISO 639-1 format. | Tests for encoding errors for the languages you specify. Optional. |
id_hash_keys | A list of strings | - | Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.Optional. |
sort_by_position | Boolean | True False (default) | Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.True - Sorts the text first by its vertical position and then by its horizontal position.False - Sorts the text according to the logical reading order in the PDF.Mandatory. |
ocr | Literal | auto full Default: None | Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.auto - Uses OCR only to extract text from images and integrate them into the existing text.full - Uses OCR to extract text from the entire PDF.Optional. |
ocr_language | String | Check supported languages. Default: eng | Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu . |
multiprocessing | Boolean | True (default)False | We use multiprocessing to speed up PyMuPDF conversion.True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.False - Disables multiprocessing. |
REST API Runtime Parameters
There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.