Converters
Converters are used in indexing pipelines to extract text from files in various formats and transform it into the document format. There are several converters available.
Pipelines search through documents stored in the document store. Documents are passages of plain text. Use converters to transform files into searchable documents.
If you add a converter to your indexing pipeline, the conversion only happens once when you deploy the pipeline. Your files are not converted every time you run a search.
Basic Information
- Pipeline type: Converters are used in indexing pipelines.
- Position in a pipeline: Either at the very beginning or after a FileTypeClassifier.
- Nodes that can precede converters in a pipeline: FileTypeClassifier
- Nodes that can follow converters in a pipeline: PreProcessor
- Node input: File paths
- Node output: Documents
- Supported types:
- CNAzureConverter
- DocxToTextConverter
- MarkdownConverter
- PDFToTextConverter
- PptxConverter
- TextConverter
Converters Overview
CNAzureConverter
CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Document Intelligence service by Microsoft Azure. It can extract content from the following file types:
- JPEG
- PNG
- MBP
- TIFF
You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.
For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.
DocxToTextConverter
Extracts text from DOCX files.
MarkdownConverter
MarkdownConverter converts Markdown files into plain text documents, removing all structured information, like bullet point lists or code block formatting.
Preprocessing Markdown Files for RAG
If you use an LLM in the query pipeline, TextConverter is more effective for preprocessing Markdown files than MarkdownConverter. LLMs are particularly adept at understanding Markdown file structures that TextConverter retains, that's why we recomment using TextConverter for processing these files.
PDFToTextConverter
The PDFToTextConverter is a fast and lightweight PDF converter that converts PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer. It can also work with image-based PDFs (for example, scanned documents).
This converter doesn't extract tables as separate documents of type 'table' but treats them as plain text. You can discard numerical tables by setting the remove_numeric_tables
parameter to False
.
PptxConverter
This converter extracts text from PPTX files. As PPTX doesn't contain page information, PptxConverter returns a list of texts from each slide in the file.
TextConverter
TextConverter converts plain text files to document objects that pipelines can use for search.
In RAG pipelines, we recommend using TextConverter to preprocess Markdown. LLMs are good at understanding structural information in Markdown files, which can help generate the right answer. Unlike MarkdownConverter, TextConverter preserves this information.
Usage Examples
...
components:
- name: AzureConverter
type: CNAzureConverter
params:
endpoint: <Document Intelligence or Cognitive Services endpoint>
credential_key: <Document Intelligence or Cognitive Services key>
model_id: prebuilt-read
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: AzureConverter
inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
- name: Preprocessor
inputs: [AzureConverter]
...
...
components:
- name: DOCXConverter
type: DocxConverter
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileClassifier
inputs: [File]
- name: DOCXConverter
inputs: [FileClassifier.output_4]
- name: PreProcessor
inputs: [DOCXConverter]
...
components:
- name: MarkdownConverter
type: MarkdownConverter
params:
remove_code_snippets: False
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: MarkdownConverter
inputs: [FileTypeClassifier.output_3] # output_3 is where Markdown files are routed
- name: Preprocessor
inputs: [MarkdownConverter]
...
...
components:
- name: PDFConverter
type: PDFToTextConverter
params:
remove_numeric_tables: True
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
- name: Preprocessor
inputs: [PDFConverter]
...
...
components:
- name: PPTXConverter
type: PptxConverter
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: PPTXConverter
inputs: [File]
- name: Preprocessor
inputs: [PPTXConverter]
...
...
components:
- name: TextFileConverter
type: TextConverter
...
pipelines:
- name: indexing
nodes:
- name: FileTypeClassifier
type: FileTypeClassifier
- name: TextFileConverter
inputs: [FileTypeClassifier.output_1] # This is where text is routed
- name: Preprocessor
inputs: [TextFileConverter]
# To use TextConverter for preprocessing Markdown files in pipelines
# containing FileTypeClassifier, add output_3 as the input for TextConverter, like this:
pipelines:
- name: indexing
nodes:
- name: FileTypeClassifier
type: FileTypeClassifier
- name: TextFileConverter
inputs: [FileTypeClassifier.output_1, FileTypeClassifier.output_3] # Output_3 is where MD files are routed
- name: Preprocessor
inputs: [TextFileConverter]
Parameters
CNAzureConverter Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
endpoint | String | Your Document Intelligence or Cognitive Services resource's endpoint. Mandatory. | |
credential_key | String | Your Document Intelligence or Cognitive Services resource's subscription key. Mandatory. | |
model_id | String | Default: prebuilt-read | The identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation. Mandatory. |
save_json | Boolean | True False Default: False | Saves the output as a JSON file. Mandatory. |
preceding_context_len | Integer | Default: 3 | Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata. Mandatory. |
following_context_len | Integer | Default: 3 | Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata. Mandatory. |
merge_multiple_column_headers | Boolean | True False Default: True | If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row. Mandatory. |
id_hash_keys | List of strings | Default: None | Generates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass ["content", "metadata"] to this field to generate IDs based on the document content and the defined metadata.Optional. |
page_layout | Literal | natural single_column Default: natural | The type reading order to follow. Possible options: - natural: Uses the natural reading order determined by Azure. - single_column: Groups all lines on the page with the same height together based on the threshold specified in threshold_y .Mandatory. |
threshold_y | Float | Default: 0.05 | The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text. The threshold is specified in inches. This is only relevant if page_layout=single_column .Optional. |
DocxConverter Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True False Default: False | Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers. Required. |
valid_languages | List of strings | Language ISO 639-1 code Default: None | Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text. Optional. |
id_hash_keys | List of strings | Default: None | Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"] ). In such a case, the ID is generated using the content and the defined metadata.Optional. |
progress_bar | Boolean | True False Default: True | Shows a progress bar for the conversion. Required. |
MarkdownConverter Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
id_hash_keys | List of strings | Default: None | Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"] ). In such a case, the ID is generated using the content and the defined metadata.Optional. |
progress_bar | Boolean | True , False Default: True | Shows a progress bar during the conversion process. Optional. |
remove_code_snippets | Boolean | True , False Default: True | Removes code snippets from the content. Optional. |
extract_headlines | Boolean | True , False Default: False | Whether to extract headlines from the content. Optional. |
add_frontmatter_to_meta | Boolean | True , False Default: False | Adds the contents of the frontmatter to the document's metadata. Optional. |
PDFToTextConverter Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True False (default) | Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). You can find this useful if your pipeline has a Reader that can't parse tables. Mandatory. |
valid_languages | A list of strings | A list of languages in the ISO 639-1 format. | Tests for encoding errors for the languages you specify. Optional. |
id_hash_keys | A list of strings | - | Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.Optional. |
sort_by_position | Boolean | True False (default) | Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.True - Sorts the text first by its vertical position and then by its horizontal position.False - Sorts the text according to the logical reading order in the PDF.Mandatory. |
ocr | Literal | auto full Default: None | Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.auto - Uses OCR only to extract text from images and integrate them into the existing text.full - Uses OCR to extract text from the entire PDF.Optional. |
ocr_language | String | Check supported languages. Default: eng | Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu . |
multiprocessing | Boolean | True (default)False | We use multiprocessing to speed up PyMuPDF conversion.True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.False - Disables multiprocessing. |
PptxConverter Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
remove_numeric_tables | Boolean | True False Default: False | Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers. Required. |
valid_languages | List of strings | Language ISO 639-1 code Default: None | Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text. Optional. |
id_hash_keys | List of strings | Default: None | Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"] ). In such a case, the ID is generated using the content and the defined metadata.Optional. |
progress_bar | Boolean | True False Default: True | Shows a progress bar for the conversion. Required. |
TextConverter Parameters
There are no parameters for TextConverter that you can configure.
Updated 4 months ago