Pipelines search through documents stored in the document store. Documents are passages of plain text. Use converters to transform files into searchable documents.

If you add a converter to your indexing pipeline, the conversion only happens once when you deploy the pipeline. Your files are not converted every time you run a search.

Basic Information

Pipeline type: Converters are used in indexing pipelines.
Position in a pipeline: Either at the very beginning or after a FileTypeClassifier.
Nodes that can precede converters in a pipeline: FileTypeClassifier
Nodes that can follow converters in a pipeline: PreProcessor
Node input: File paths
Node output: Documents
Supported types:
- CNAzureConverter
- DocxToTextConverter
- MarkdownConverter
- PDFToTextConverter
- PptxConverter
- TextConverter

Converters Overview

CNAzureConverter

CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Document Intelligence service by Microsoft Azure. It can extract content from the following file types:

PDF
JPEG
PNG
MBP
TIFF

You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.

📘
For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.

DocxToTextConverter

Extracts text from DOCX files.

MarkdownConverter

MarkdownConverter converts Markdown files into plain text documents, removing all structured information, like bullet point lists or code block formatting.

📘
Preprocessing Markdown Files for RAG
If you use an LLM in the query pipeline, TextConverter is more effective for preprocessing Markdown files than MarkdownConverter. LLMs are particularly adept at understanding Markdown file structures that TextConverter retains, that's why we recomment using TextConverter for processing these files.

PDFToTextConverter

The PDFToTextConverter is a fast and lightweight PDF converter that converts PDF files to plain text. It works well with most digitally created or searchable PDFs containing a text layer. It can also work with image-based PDFs (for example, scanned documents).

This converter doesn't extract tables as separate documents of type 'table' but treats them as plain text. You can discard numerical tables by setting the remove_numeric_tables parameter to False.

PptxConverter

This converter extracts text from PPTX files. As PPTX doesn't contain page information, PptxConverter returns a list of texts from each slide in the file.

TextConverter

TextConverter converts plain text files to document objects that pipelines can use for search.

In RAG pipelines, we recommend using TextConverter to preprocess Markdown. LLMs are good at understanding structural information in Markdown files, which can help generate the right answer. Unlike MarkdownConverter, TextConverter preserves this information.

Usage Examples

...
components:
  - name: AzureConverter
    type: CNAzureConverter
    params: 
      endpoint: <Document Intelligence or Cognitive Services endpoint>
      credential_key: <Document Intelligence or Cognitive Services key>
      model_id: prebuilt-read
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: AzureConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [AzureConverter]
...

...
components:
  - name: DOCXConverter
    type: DocxConverter
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileClassifier
        inputs: [File] 
      - name: DOCXConverter
        inputs: [FileClassifier.output_4]
      - name: PreProcessor
        inputs: [DOCXConverter]
...

components:
  - name: MarkdownConverter
    type: MarkdownConverter
     params: 
      remove_code_snippets: False 
    ...
pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: MarkdownConverter
        inputs: [FileTypeClassifier.output_3] # output_3 is where Markdown files are routed
      - name: Preprocessor
        inputs: [MarkdownConverter]
...

...
components:
  - name: PDFConverter
    type: PDFToTextConverter
    params: 
    	remove_numeric_tables: True
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [PDFConverter]
...

...
components:
  - name: PPTXConverter
    type: PptxConverter
  
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: PPTXConverter
        inputs: [File] 
      - name: Preprocessor
        inputs: [PPTXConverter]
...

...
components:
  - name: TextFileConverter
    type: TextConverter
...
pipelines:
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        type: FileTypeClassifier
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1] # This is where text is routed
      - name: Preprocessor
        inputs: [TextFileConverter]
        
# To use TextConverter for preprocessing Markdown files in pipelines 
# containing FileTypeClassifier, add output_3 as the input for TextConverter, like this:
pipelines:
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        type: FileTypeClassifier
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1, FileTypeClassifier.output_3] # Output_3 is where MD files are routed
      - name: Preprocessor
        inputs: [TextFileConverter]

Parameters

CNAzureConverter Parameters

Parameter	Type	Possible Values	Description
`endpoint`	String		Your Document Intelligence or Cognitive Services resource's endpoint. Mandatory.
`credential_key`	String		Your Document Intelligence or Cognitive Services resource's subscription key. Mandatory.
`model_id`	String	Default: `prebuilt-read`	The identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation. Mandatory.
`save_json`	Boolean	`True` `False` Default: `False`	Saves the output as a JSON file. Mandatory.
`preceding_context_len`	Integer	Default: `3`	Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata. Mandatory.
`following_context_len`	Integer	Default: `3`	Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata. Mandatory.
`merge_multiple_column_headers`	Boolean	`True` `False` Default: `True`	If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row. Mandatory.
`id_hash_keys`	List of strings	Default: `None`	Generates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass `["content", "metadata"]` to this field to generate IDs based on the document content and the defined metadata. Optional.
`page_layout`	Literal	`natural` `single_column` Default: `natural`	The type reading order to follow. Possible options: - natural: Uses the natural reading order determined by Azure. - single_column: Groups all lines on the page with the same height together based on the threshold specified in `threshold_y`. Mandatory.
`threshold_y`	Float	Default: `0.05`	The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text. The threshold is specified in inches. This is only relevant if `page_layout=single_column`. Optional.

DocxConverter Parameters

Parameter	Type	Possible Values	Description
`remove_numeric_tables`	Boolean	`True` `False` Default: `False`	Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers. Required.
`valid_languages`	List of strings	Language ISO 639-1 code Default: `None`	Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text. Optional.
`id_hash_keys`	List of strings	Default: `None`	Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: `["content", "meta"]`). In such a case, the ID is generated using the content and the defined metadata. Optional.
`progress_bar`	Boolean	`True` `False` Default: `True`	Shows a progress bar for the conversion. Required.

MarkdownConverter Parameters

Parameter	Type	Possible Values	Description
`id_hash_keys`	List of strings	Default: `None`	Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: `["content", "meta"]`). In such a case, the ID is generated using the content and the defined metadata. Optional.
`progress_bar`	Boolean	`True`, `False` Default: `True`	Shows a progress bar during the conversion process. Optional.
`remove_code_snippets`	Boolean	`True`, `False` Default: `True`	Removes code snippets from the content. Optional.
`extract_headlines`	Boolean	`True`, `False` Default: `False`	Whether to extract headlines from the content. Optional.
`add_frontmatter_to_meta`	Boolean	`True`, `False` Default: `False`	Adds the contents of the frontmatter to the document's metadata. Optional.

PDFToTextConverter Parameters

Parameter	Type	Possible Values	Description
`remove_numeric_tables`	Boolean	`True` `False` (default)	Deletes numeric rows from tables (uses heuristic to remove rows with more than 40% digits and not ending with a period). You can find this useful if your pipeline has a Reader that can't parse tables. Mandatory.
`valid_languages`	A list of strings	A list of languages in the ISO 639-1 format.	Tests for encoding errors for the languages you specify. Optional.
`id_hash_keys`	A list of strings	-	Generates the document ID from a custom list of strings that refer to the document's attributes. For example, to ensure that there are no duplicate documents in your document store, you can modify the metadata of a document by passing: `["content", "meta"]` to this field. Optional.
`sort_by_position`	Boolean	`True` `False` (default)	Specifies if the extracted text should be sorted by its location coordinates or by the logical reading order. `True` - Sorts the text first by its vertical position and then by its horizontal position. `False` - Sorts the text according to the logical reading order in the PDF. Mandatory.
`ocr`	Literal	`auto` `full` Default: None	Specifies if optical character recognition (OCR) should be used to extract text from the images in the PDF. `auto` - Uses OCR only to extract text from images and integrate them into the existing text. `full` - Uses OCR to extract text from the entire PDF. Optional.
`ocr_language`	String	Check supported languages. Default: `eng`	Specifies the language to use for optical character recognition. To combine multiple languages, pass a string with the language codes separated with a plus ("+"). For example, to use English and German, pass `eng+deu`.
`multiprocessing`	Boolean	`True` (default) `False`	We use multiprocessing to speed up PyMuPDF conversion. `True` - Uses the total number of cores. To specify the number of cores to use, set this value to an integer. `False` - Disables multiprocessing.

PptxConverter Parameters

Parameter	Type	Possible Values	Description
`remove_numeric_tables`	Boolean	`True` `False` Default: `False`	Uses heuristics to remove numeric rows from tables in the files. Retains table rows containing strings that may be candidates for searching for answers. Required.
`valid_languages`	List of strings	Language ISO 639-1 code Default: `None`	Validates languages specified in the ISO 639-1 format. You can use this option to add tests for encoding errors. If the extracted text is not one of the valid languages, it means there's a chance of an encoding error resulting in garbled text. Optional.
`id_hash_keys`	List of strings	Default: `None`	Generates document ID from a custom list of strings that refer to the document's attributes. To ensure there aren't duplicate documents in the document store when texts are not unique, modify the metadata and pass "meta" to this field (for example: `["content", "meta"]`). In such a case, the ID is generated using the content and the defined metadata. Optional.
`progress_bar`	Boolean	`True` `False` Default: `True`	Shows a progress bar for the conversion. Required.

TextConverter Parameters

There are no parameters for TextConverter that you can configure.

Basic Information

Converters Overview

CNAzureConverter

📘

DocxToTextConverter

MarkdownConverter

📘Preprocessing Markdown Files for RAG

PDFToTextConverter

PptxConverter

TextConverter

Usage Examples

Parameters

CNAzureConverter Parameters

DocxConverter Parameters

MarkdownConverter Parameters

PDFToTextConverter Parameters

PptxConverter Parameters

TextConverter Parameters

📘
Preprocessing Markdown Files for RAG