CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Form Recognizer service by Microsoft Azure. It can extract content from the following file types:

PDF
JPEG
PNG
MBP
TIFF

You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.

📘
For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.

Basic Information

Pipeline type: Used in indexing pipelines.
Nodes that can precede it in a pipeline: FileTypeClassifier
Nodes that can follow it in a pipeline: PreProcessor
Input: File paths
Output: Documents
Available node classes: CNAzureConverter

Usage Example

In this example, CNAzureConverter is used to convert PDF files, that's why it takes output_2 from FileTypeClassifier. This is where the PDF files are routed.

...
components:
  - name: AzureConverter
    type: CNAzureConverter
    params: 
      endpoint: <Form Recognizer or Cognitive Services endpoint>
      credential_key: <FormRecognizer or Cognitive Services key>
      model_id: prebuilt-read
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: AzureConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [AzureConverter]
...

Parameters

Here are the parameters you can pass to this node in the pipeline YAML configuration:

Parameter	Type	Possible Values	Description
`endpoint`	String		Your Document Intelligence or Cognitive Services resource's endpoint. Mandatory.
`credential_key`	String		Your Document Intelligence or Cognitive Services resource's subscription key. Mandatory.
`model_id`	String	Default: `prebuilt-read`	The identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation. Mandatory.
`save_json`	Boolean	`True` `False` Default: `False`	Saves the output as a JSON file. Mandatory.
`preceding_context_len`	Integer	Default: `3`	Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata. Mandatory.
`following_context_len`	Integer	Default: `3`	Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata. Mandatory.
`merge_multiple_column_headers`	Boolean	`True` `False` Default: `True`	If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row. Mandatory.
`id_hash_keys`	List of strings	Default: `None`	Generates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass `["content", "metadata"]` to this field to generate IDs based on the document content and the defined metadata. Optional.
`page_layout`	Literal	`natural` `single_column` Default: `natural`	The type reading order to follow. Possible options: - natural: Uses the natural reading order determined by Azure. - single_column: Groups all lines on the page with the same height together based on the threshold specified in `threshold_y`. Mandatory.
`threshold_y`	Float	Default: `0.05`	The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text. The threshold is specified in inches. This is only relevant if `page_layout=single_column`. Optional.