Converters

Converters are used in indexing pipelines to extract text from files in various formats and transform it into the document format. There are several converters available.

Pipelines search through documents stored in the document store. Documents are passages of plain text. Use converters to transform files into searchable documents.

If you add a converter to your indexing pipeline, the conversion only happens once when you deploy the pipeline. Your files are not converted every time you run a search.

Basic Information

  • Pipeline type: Converters are used in indexing pipelines.
  • Position in a pipeline: Either at the very beginning or after a FileTypeClassifier.
  • Nodes that can precede converters in a pipeline: FileTypeClassifier
  • Nodes that can follow converters in a pipeline: PreProcessor
  • Node input: File paths
  • Node output: Documents
  • Supported types:
    • CNAzureConverter
    • DocxToTextConverter
    • MarkdownConverter
    • PDFToTextConverter
    • PptxConverter
    • TextConverter

Converters Overview

CNAzureConverter

CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Form Recognizer service by Microsoft Azure. It can extract content from the following file types:

  • PDF
  • JPEG
  • PNG
  • MBP
  • TIFF

You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.

πŸ“˜

For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.

DocxToTextConverter

Extracts text from DOCX files.

MarkdownConverter

MarkdownConverter converts Markdown files into plain text documents, removing all structured information, like bullet point lists or code block formatting.

πŸ“˜

Preprocessing Markdown Files for RAG

If you use an LLM in the query pipeline, TextConverter is more effective for preprocessing Markdown files than MarkdownConverter. LLMs are particularly adept at understanding Markdown file structures that TextConverter retains, that's why we recomment using TextConverter for processing these files.

PDFToTextConverter

The PDFToTextConverter is a fast and lightweight PDF converter that converts PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer. It can also work with image-based PDFs (for example, scanned documents).

This converter doesn't extract tables as separate documents of type 'table' but treats them as plain text. You can discard numerical tables by setting the remove_numeric_tables parameter to False.

PptxConverter

This converter extracts text from PPTX files. As PPTX doesn't contain page information, PptxConverter returns a list of texts from each slide in the file.

TextConverter

TextConverter converts plain text files to document objects that pipelines can use for search.

In RAG pipelines, we recommend using TextConverter to preprocess Markdown. LLMs are good at understanding structural information in Markdown files, which can help generate the right answer. Unlike MarkdownConverter, TextConverter preserves this information.

Usage Examples

...
components:
  - name: AzureConverter
    type: CNAzureConverter
    params: 
      endpoint: <Form Recognizer or Cognitive Services endpoint>
      credential_key: <FormRecognizer or Cognitive Services key>
      model_id: prebuilt-read
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: AzureConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [AzureConverter]
...
...
components:
  - name: DOCXConverter
    type: DocxConverter
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileClassifier
        inputs: [File] 
      - name: DOCXConverter
        inputs: [FileClassifier.output_4]
      - name: PreProcessor
        inputs: [DOCXConverter]
...
components:
  - name: MarkdownConverter
    type: MarkdownConverter
     params: 
      remove_code_snippets: False 
    ...
pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: MarkdownConverter
        inputs: [FileTypeClassifier.output_3] # output_3 is where Markdown files are routed
      - name: Preprocessor
        inputs: [MarkdownConverter]
...
...
components:
  - name: PDFConverter
    type: PDFToTextConverter
    params: 
    	remove_numeric_tables: True
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [PDFConverter]
...
...
components:
  - name: PPTXConverter
    type: PptxConverter
  
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: PPTXConverter
        inputs: [File] 
      - name: Preprocessor
        inputs: [PPTXConverter]
...
...
components:
  - name: TextFileConverter
    type: TextConverter
...
pipelines:
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        type: FileTypeClassifier
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1] # This is where text is routed
      - name: Preprocessor
        inputs: [TextFileConverter]
        
# To use TextConverter for preprocessing Markdown files in pipelines 
# containing FileTypeClassifier, add output_3 as the input for TextConverter, like this:
pipelines:
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        type: FileTypeClassifier
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1, FileTypeClassifier.output_3] # Output_3 is where MD files are routed
      - name: Preprocessor
        inputs: [TextFileConverter]

Parameters

CNAzureConverter Parameters

ParameterTypePossible ValuesDescription
endpointStringYour Form Recognizer or Cognitive Services resource's endpoint.
Mandatory.
credential_keyStringYour Form Recognizer or Cognitive Services resource's subscription key.
Mandatory.
model_idStringDefault: prebuilt-readThe identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation.
Mandatory.
save_jsonBooleanTrue
False
Default: False
Saves the output as a JSON file.
Mandatory.
preceding_context_lenIntegerDefault: 3Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata.
Mandatory.
following_context_lenIntegerDefault: 3Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata.
Mandatory.
merge_multiple_column_headersBooleanTrue
False
Default: True
If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row.
Mandatory.
id_hash_keysList of stringsDefault: NoneGenerates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass ["content", "metadata"] to this field to generate IDs based on the document content and the defined metadata.
Optional.
page_layoutLiteralnatural
single_column
Default: natural
The type reading order to follow. Possible options:
- natural: Uses the natural reading order determined by Azure.
- single_column: Groups all lines on the page with the same height together based on the threshold specified in threshold_y.
Mandatory.
threshold_yFloatDefault: 0.05The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text.
The threshold is specified in inches.
This is only relevant if page_layout=single_column.
Optional.

DocxConverter Parameters

ParameterTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue
False
Default: False
Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers.
Required.
valid_languagesList of stringsLanguage ISO 639-1 code
Default: None
Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text.
Optional.
id_hash_keysList of stringsDefault: NoneGenerates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"]). In such a case, the ID is generated using the content and the defined metadata.
Optional.
progress_barBooleanTrue
False
Default: True
Shows a progress bar for the conversion.
Required.


MarkdownConverter Parameters

ParameterTypePossible ValuesDescription
id_hash_keysList of stringsDefault: NoneGenerates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"]). In such a case, the ID is generated using the content and the defined metadata.
Optional.
progress_barBooleanTrue, False
Default: True
Shows a progress bar during the conversion process.
Optional.
remove_code_snippetsBooleanTrue, False
Default: True
Removes code snippets from the content.
Optional.
extract_headlinesBooleanTrue, False
Default: False
Whether to extract headlines from the content.
Optional.
add_frontmatter_to_metaBooleanTrue, False
Default: False
Adds the contents of the frontmatter to the document's metadata. Optional.

PDFToTextConverter Parameters

ParameterTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue
False (default)
Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period).
You can find this useful if your pipeline has a Reader that can't parse tables.
Mandatory.
valid_languagesA list of stringsA list of languages in the ISO 639-1 format. Tests for encoding errors for the languages you specify.
Optional.
id_hash_keysA list of strings-Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: ["content", "meta"] to this field.
Optional.
sort_by_positionBooleanTrue
False (default)
Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order.
True - Sorts the text first by its vertical position and then by its horizontal position.
False - Sorts the text according to the logical reading order in the PDF.
Mandatory.
ocrLiteralauto
full
Default: None
Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF.
auto - Uses OCR only to extract text from images and integrate them into the existing text.
full - Uses OCR to extract text from the entire PDF.
Optional.
ocr_languageStringCheck supported languages.
Default: eng
Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass eng+deu.
multiprocessingBooleanTrue (default)
False
We use multiprocessing to speed up PyMuPDF conversion.
True - Uses the total number of cores. To specify the number of cores to use, set this value to an integer.
False - Disables multiprocessing.

PptxConverter Parameters

ParameterTypePossible ValuesDescription
remove_numeric_tablesBooleanTrue
False
Default: False
Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers.
Required.
valid_languagesList of stringsLanguage ISO 639-1 code
Default: None
Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text.
Optional.
id_hash_keysList of stringsDefault: NoneGenerates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: ["content", "meta"]). In such a case, the ID is generated using the content and the defined metadata.
Optional.
progress_barBooleanTrue
False
Default: True
Shows a progress bar for the conversion.
Required.

TextConverter Parameters

There are no parameters for TextConverter that you can configure.