AzureOCRDocumentConverter Parameters

Check the init and runtime parameters you can pass for this component.

YAML Init Parameters

You can specify the following parameters for AzureOCRDocumentConverter in the pipeline YAML:

Parameter

Type

Possible values

Description

endpoint

String

The endpoint of your Azure resource.
Required.

api_key

String

Uses the AZURE_AI_API_KEY environment variable by default.

The API key to connect to your Azure resource.
Required.

model_id

String

Default: prebuilt-read

The ID of the model you want to use to convert files to documents. For a list of supported models, see Microsoft documentation.
Required.

preceding_context_len

Integer

Default: 3

The number of lines before a table to extract as its preceding context.
Required.

following_context_len

Integer

Default: 3

The number of lines after a table to extract as its subsequent context.
Required.

merge_multiple_column_headers

Boolean

True
False
Default: True

If a table contains more than one row used as a header, this parameter specifies if you want to merge multiple header rows into a single row.
Required.

page_layout

Literal

natural
single_column
Default: natural

Specifies the type of reading order to follow. Possible values are:

  • natural: Follows a natural reading order determined by Azure.
  • single_column: All lines with the same heights on the page are grouped together based on a threshold set in threshold_y.
    Required.

threshold_y

Float

Default: 0.05

The threshold to determine if two recognized elements in a PDF should be grouped into a single line. This is especially relevant for section headers or numbers, which may be spatially separated on the horizontal axis from the remaining text. The threshold is specified in units of inches.
This setting is only relevant if single_column=page_layout.
Optional.


REST API Runtime Parameters

There are no runtime parameters you can pass to this component when making a request to the Search REST API endpoint.