Skip to main content

AzureOCRDocumentConverter

Convert files to documents using Azure's Document Intelligence service.

Basic Information

  • Type: haystack_integrations.converters.azure.AzureOCRDocumentConverter
  • Components it can connect with:
    • FileTypeRouter: AzureOCRDocumentConverter can receive sources from a FileTypeRouter.
    • DocumentJoiner: AzureOCRDocumentConverter can send the converted documents to a DocumentJoiner that joins documents from all the converters in the pipeline..

Inputs

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]List of file paths or ByteStream objects.
metaOptional[List[Dict[str, Any]]]NoneOptional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists are zipped. If sources contains ByteStream objects, their meta is added to the output Documents.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]The output documents.
raw_azure_responseList[Dict]List of raw Azure responses used to create the documents.

Overview

AzureOCRDocumentConverter takes a list of file path or ByteStream objects and converts them to Haystack Document objects using Azure's Document Intelligence service. Optionally, you can attach metadata to the documents using the meta input. It supports the following file formats: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, and HTML.

Authorization

To use this component, you need an active Azure account and a Document Intelligence or Cognitive Services resource. For help with setting up your resource, see Azure documentation.

Connect deepset to your Azure account on the Integrations page.

Add Workspace-Level Integration

  1. Click your profile icon and choose Settings.
  2. Go to Workspace>Integrations.
  3. Find the provider you want to connect and click Connect next to them.
  4. Enter the API key and any other required details.
  5. Click Connect. You can use this integration in pipelines and indexes in the current workspace.

Add Organization-Level Integration

  1. Click your profile icon and choose Settings.
  2. Go to Organization>Integrations.
  3. Find the provider you want to connect and click Connect next to them.
  4. Enter the API key and any other required details.
  5. Click Connect. You can use this integration in pipelines and indexes in all workspaces in the current organization.

Usage Example

Initializing the Component

components:
AzureOCRDocumentConverter:
type: components.converters.azure.AzureOCRDocumentConverter
init_parameters:
endpoint: <your-azure-endpoint>
api_key: <your-azure-api-key>
model_id: prebuilt-read
page_layout: natural

Using the Component in an Index

This example shows how to use the AzureOCRDocumentConverter in an index. It uses a FileTypeRouter to classify the files and then uses the AzureOCRDocumentConverter to convert the files to documents.

components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv

text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

ocr_converter:
type: haystack.components.converters.azure.AzureOCRDocumentConverter
init_parameters:
api_key: {"type": "env_var", "env_vars": ["AZURE_AI_API_KEY"], "strict": false}
endpoint: "YOUR-ENDPOINT"
model_id: "prebuilt-read"
page_layout: "natural"

markdown_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
# A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
# For the full list of available arguments, see
# the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
extraction_kwargs:
output_format: markdown # Extract text from HTML. You can also also choose "txt"
target_language: # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
include_tables: true # If true, includes tables in the output
include_links: true # If true, keeps links along with their targets

docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters:
link_format: markdown

pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}

xlsx_converter:
type: deepset_cloud_custom_nodes.converters.xlsx.XLSXToDocument
init_parameters: {}

csv_converter:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8

joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

joiner_xlsx: # merge split documents with non-split xlsx documents
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en

document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2

writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE

connections: # Defines how the components are connected
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.application/pdf
receiver: ocr_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
receiver: csv_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: ocr_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
receiver: joiner_xlsx.documents
- sender: csv_converter.documents
receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents

inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- file_classifier.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
endpointstrThe endpoint of your Azure resource.
api_keySecretSecret.from_env_var('AZURE_AI_API_KEY')The API key of your Azure resource.
model_idstrprebuilt-readThe ID of the model you want to use. For a list of available models, see [Azure documentation] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature).
preceding_context_lenint3Number of lines before a table to include as preceding context (this will be added to the metadata).
following_context_lenint3Number of lines after a table to include as subsequent context ( this will be added to the metadata).
merge_multiple_column_headersboolTrueIf True, merges multiple column header rows into a single row.
page_layoutLiteral['natural', 'single_column']naturalThe type reading order to follow. Possible options: - natural: Uses the natural reading order determined by Azure. - single_column: Groups all lines with the same height on the page based on a threshold determined by threshold_y.
threshold_yOptional[float]0.05Only relevant if single_column is set to page_layout. The threshold, in inches, to determine if two recognized PDF elements are grouped into a single line. This is crucial for section headers or numbers which may be spatially separated from the remaining text on the horizontal axis.
store_full_pathboolFalseIf True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]List of file paths or ByteStream objects.
metaOptional[List[Dict[str, Any]]]NoneOptional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists are zipped. If sources contains ByteStream objects, their meta is added to the output Documents.