CNAzureConverter

Use CNAzureConverter in your indexing pipelines to extract text and tables from files in different formats using Microsoft Azure Form Recognizer.

CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Form Recognizer service by Microsoft Azure. It can extract content from the following file types:

  • PDF
  • JPEG
  • PNG
  • MBP
  • TIFF

You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.

📘

For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.

Basic Information

  • Pipeline type: Used in indexing pipelines.
  • Nodes that can precede it in a pipeline: FileTypeClassifier
  • Nodes that can follow it in a pipeline: PreProcessor
  • Input: File paths
  • Output: Documents
  • Available node classes: CNAzureConverter

Usage Example

In this example, CNAzureConverter is used to convert PDF files, that's why it takes output_2 from FileTypeClassifier. This is where the PDF files are routed.

...
components:
  - name: AzureConverter
    type: CNAzureConverter
    params: 
      endpoint: <Form Recognizer or Cognitive Services endpoint>
      credential_key: <FormRecognizer or Cognitive Services key>
      model_id: prebuilt-read
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: AzureConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [AzureConverter]
...

Parameters

Here are the parameters you can pass to this node in the pipeline YAML configuration:

ParameterTypePossible ValuesDescription
endpointStringYour Form Recognizer or Cognitive Services resource's endpoint.
Mandatory.
credential_keyStringYour Form Recognizer or Cognitive Services resource's subscription key.
Mandatory.
model_idStringDefault: prebuilt-readThe identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation.
Mandatory.
save_jsonBooleanTrue
False
Default: False
Saves the output as a JSON file.
Mandatory.
preceding_context_lenIntegerDefault: 3Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata.
Mandatory.
following_context_lenIntegerDefault: 3Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata.
Mandatory.
merge_multiple_column_headersBooleanTrue
False
Default: True
If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row.
Mandatory.
id_hash_keysList of stringsDefault: NoneGenerates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass ["content", "metadata"] to this field to generate IDs based on the document content and the defined metadata.
Optional.
page_layoutLiteralnatural
single_column
Default: natural
The type reading order to follow. Possible options:
- natural: Uses the natural reading order determined by Azure.
- single_column: Groups all lines on the page with the same height together based on the threshold specified in threshold_y.
Mandatory.
threshold_yFloatDefault: 0.05The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text.
The threshold is specified in inches.
This is only relevant if page_layout=single_column.
Optional.