CNAzureConverter
Use CNAzureConverter in your indexing pipelines to extract text and tables from files in different formats using Microsoft Azure Form Recognizer.
CNAzureConverter extracts text and tables from files and converts them into documents you can store in the document store and use in your pipelines. It uses the Form Recognizer service by Microsoft Azure. It can extract content from the following file types:
- JPEG
- PNG
- MBP
- TIFF
You must have an active Azure account and a Form Recognizer or Cognitive Services resource. For information on how to set it up, see Microsoft Azure documentation.
For PDF files, the extracted text is not available in the PDF view in deepset Cloud. So when you search with your pipeline and you choose to View File under an answer, you're not going to see the extracted text in the PDF file that opens. This is because this node is used in the indexing pipeline, which stores the contents of the files in the document store which the query pipeline then uses to search.
Basic Information
- Pipeline type: Used in indexing pipelines.
- Nodes that can precede it in a pipeline: FileTypeClassifier
- Nodes that can follow it in a pipeline: PreProcessor
- Input: File paths
- Output: Documents
- Available node classes: CNAzureConverter
Usage Example
In this example, CNAzureConverter is used to convert PDF files, that's why it takes output_2 from FileTypeClassifier. This is where the PDF files are routed.
...
components:
- name: AzureConverter
type: CNAzureConverter
params:
endpoint: <Form Recognizer or Cognitive Services endpoint>
credential_key: <FormRecognizer or Cognitive Services key>
model_id: prebuilt-read
...
pipelines:
# here comes the query pipeline which we skipped in this example
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: AzureConverter
inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
- name: Preprocessor
inputs: [AzureConverter]
...
Parameters
Here are the parameters you can pass to this node in the pipeline YAML configuration:
Parameter | Type | Possible Values | Description |
---|---|---|---|
endpoint | String | Your Document Intelligence or Cognitive Services resource's endpoint. Mandatory. | |
credential_key | String | Your Document Intelligence or Cognitive Services resource's subscription key. Mandatory. | |
model_id | String | Default: prebuilt-read | The identifier of the model you want to use to extract information out of your file. For a list of available models, see Azure Documentation. Mandatory. |
save_json | Boolean | True False Default: False | Saves the output as a JSON file. Mandatory. |
preceding_context_len | Integer | Default: 3 | Specifies the number of lines that precede a table to extract as preceding context. It's returned as metadata. Mandatory. |
following_context_len | Integer | Default: 3 | Specifies the number of lines after a table to extract as subsequent context. It's returned as metadata. Mandatory. |
merge_multiple_column_headers | Boolean | True False Default: True | If a table contains more than one row as a column header, this parameter lets you merge these rows into a single row. Mandatory. |
id_hash_keys | List of strings | Default: None | Generates the document ID from a custom list of strings that refer to the document's attributes. To make sure there are no duplicate documents in your document store if document texts are the same, you can modify the metadata of a document and then pass ["content", "metadata"] to this field to generate IDs based on the document content and the defined metadata.Optional. |
page_layout | Literal | natural single_column Default: natural | The type reading order to follow. Possible options: - natural: Uses the natural reading order determined by Azure. - single_column: Groups all lines on the page with the same height together based on the threshold specified in threshold_y .Mandatory. |
threshold_y | Float | Default: 0.05 | The threshold to determine if two elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers which may be spacially separated on the horizontal axis from the remaining text. The threshold is specified in inches. This is only relevant if page_layout=single_column .Optional. |
Updated 8 months ago