Use Azure Document Intelligence

Convert files to documents using the Azure's Document Intelligence service.

About this Task

Azure Document Intelligence extracts text from files in the following formats:

  • JPEG
  • PNG
  • BMP
  • TIFF
  • DOCX
  • XLSX
  • PPTX
  • HTML

For more details on the service capabilities, see the Azure Document Intelligence website. For a list of models you can use to process your files, see model overview in Document Intelligence documentation.

Prerequisites

You need an API key from your Azure account with the Document Intelligence resource. For details, see Get started wtih Document Intelligence in Azure documentation.

Use Azure Document Intelligence

First, connect deepset Cloud to Azure Document Intelligence through the Connections page:

  1. Click your name in the top right corner and select Connections.
    A screen shot of the deepset Cloud UI with the personal menu expanded and the Connections option underlined.
  2. Click Connect next to a model provider.
  3. Enter your user access token and submit it.

Then, add the CNAzureConverter node to your indexing pipeline.

Usage Example

...
components:
  - name: AzureConverter
    type: CNAzureConverter
    params: 
      endpoint: <Form Recognizer or Cognitive Services endpoint>
      credential_key: "" # Leave this field as an empty string
      model_id: prebuilt-read
...

pipelines:
# here comes the query pipeline which we skipped in this example
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: AzureConverter
        inputs: [FileTypeClassifier.output_2] # output_2 is where PDF files are routed
      - name: Preprocessor
        inputs: [AzureConverter]
...

Related Links