AzureOCRDocumentConverter

Convert files to documents using Azure's Document Intelligence service.

Basic Information

Type: haystack_integrations.converters.azure.AzureOCRDocumentConverter
Components it can connect with:
- FileTypeRouter: AzureOCRDocumentConverter can receive sources from a FileTypeRouter.
- DocumentJoiner: AzureOCRDocumentConverter can send the converted documents to a DocumentJoiner that joins documents from all the converters in the pipeline..

Inputs

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		List of file paths or ByteStream objects.
meta	Optional[List[Dict[str, Any]]]	None	Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists are zipped. If `sources` contains ByteStream objects, their `meta` is added to the output Documents.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		The output documents.
raw_azure_response	List[Dict]		List of raw Azure responses used to create the documents.

Overview

AzureOCRDocumentConverter takes a list of file path or ByteStream objects and converts them to Haystack Document objects using Azure's Document Intelligence service. Optionally, you can attach metadata to the documents using the meta input. It supports the following file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.

Authorization

To use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see Azure documentation.

Connect deepset to your Azure account on the Integrations page. For detailed instructions, see Use Azure Models.

Usage Example

Using the Component in an Index

This example shows how to use the AzureOCRDocumentConverter in an index. It uses a FileTypeRouter to classify the files and then uses the AzureOCRDocumentConverter to convert the files to documents.

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
      - text/plain
      - application/pdf
      - text/markdown
      - text/html
      - application/vnd.openxmlformats-officedocument.wordprocessingml.document
      - application/vnd.openxmlformats-officedocument.presentationml.presentation
      - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      - text/csv

  text_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  ocr_converter:
    type: haystack.components.converters.azure.AzureOCRDocumentConverter
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["AZURE_AI_API_KEY"], "strict": false}
      endpoint: "YOUR-ENDPOINT"
      model_id: "prebuilt-read"
      page_layout: "natural"

  markdown_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      # A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
      # For the full list of available arguments, see
      # the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
      extraction_kwargs:
        output_format: markdown # Extract text from HTML. You can also also choose "txt"
        target_language:       # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
        include_tables: true  # If true, includes tables in the output
        include_links: true  # If true, keeps links along with their targets

  docx_converter:
    type: haystack.components.converters.docx.DOCXToDocument
    init_parameters:
      link_format: markdown

  pptx_converter:
    type: haystack.components.converters.pptx.PPTXToDocument
    init_parameters: {}

  xlsx_converter:
    type: haystack.components.converters.XLSXToDocument
    init_parameters: {}

  csv_converter:
    type: haystack.components.converters.csv.CSVToDocument
    init_parameters:
      encoding: utf-8

  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  joiner_xlsx:  # merge split documents with non-split xlsx documents
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en

  document_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
    init_parameters:
      normalize_embeddings: true
      model: intfloat/e5-base-v2

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

connections:  # Defines how the components are connected
- sender: file_classifier.text/plain
  receiver: text_converter.sources
- sender: file_classifier.application/pdf
  receiver: ocr_converter.sources
- sender: file_classifier.text/markdown
  receiver: markdown_converter.sources
- sender: file_classifier.text/html
  receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
  receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
  receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
  receiver: csv_converter.sources
- sender: text_converter.documents
  receiver: joiner.documents
- sender: ocr_converter.documents
  receiver: joiner.documents
- sender: markdown_converter.documents
  receiver: joiner.documents
- sender: html_converter.documents
  receiver: joiner.documents
- sender: docx_converter.documents
  receiver: joiner.documents
- sender: pptx_converter.documents
  receiver: joiner.documents
- sender: joiner.documents
  receiver: splitter.documents
- sender: splitter.documents
  receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
  receiver: joiner_xlsx.documents
- sender: csv_converter.documents
  receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
  receiver: document_embedder.documents
- sender: document_embedder.documents
  receiver: writer.documents

inputs:  # Define the inputs for your pipeline
  files:                            # This component will receive the files to index as input
  - file_classifier.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
endpoint	str		The endpoint of your Azure resource.
api_key	Secret	Secret.from_env_var('AZURE_AI_API_KEY')	The API key of your Azure resource.
model_id	str	prebuilt-read	The ID of the model you want to use. For a list of available models, see [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
preceding_context_len	int	3	Number of lines before a table to include as preceding context (this will be added to the metadata).
following_context_len	int	3	Number of lines after a table to include as subsequent context ( this will be added to the metadata).
merge_multiple_column_headers	bool	True	If `True`, merges multiple column header rows into a single row.
page_layout	Literal['natural', 'single_column']	natural	The type reading order to follow. Possible options: - `natural`: Uses the natural reading order determined by Azure. - `single_column`: Groups all lines with the same height on the page based on a threshold determined by `threshold_y`.
threshold_y	Optional[float]	0.05	Only relevant if `single_column` is set to `page_layout`. The threshold, in inches, to determine if two recognized PDF elements are grouped into a single line. This is crucial for section headers or numbers which may be spatially separated from the remaining text on the horizontal axis.
store_full_path	bool	False	If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		List of file paths or ByteStream objects.
meta	Optional[List[Dict[str, Any]]]	None	Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists are zipped. If `sources` contains ByteStream objects, their `meta` is added to the output Documents.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Authorization​

Usage Example​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​