PDFToTextConverter
Before you can run a search on your PDF files using a deepset Cloud pipeline, you must convert these files into Document objects. Documents are passages of plain text that pipelines use for search. Use PDFToTextConverter
to convert PDF files to plain text document objects.
PDFToTextConverter
preprocesses PDF files and returns documents. These documents are then stored in the Document Store. Documents are what the pipeline uses for search.
File conversion happens only once when you deploy your pipeline. Your files are not converted every time you run the search. If you add a file after you deploy a pipeline, only this file is converted.
PDFToTextConverter takes File
as input and produces document
as output.
Usage
You can use it in your indexing pipeline as the first node. First, define PDFToTextConverter
in the components section of your pipeline definition file:
components:
- name: PDFConverter
type: PDFToTextConverter
params:
remove_numeric_tables: True
And then add it to your pipeline:
pipelines:
- name: indexing
nodes:
- name: PDFConverter
inputs: [File]
- name: Preprocessor
inputs: [PDFConverter]
Arguments
You can specify the following arguments for PDFToTextConverter
:
Argument | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True/False | Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). This may be useful for Readers that don't have the table parsing capability. |
valid_languages | A list of strings | A list of languages in the ISO 639-1 format. | Tests for encoding errors for the languages you specify. |
id_hash_keys | A list of strings | Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field. |
Updated 3 months ago