AzureOpenAIDocumentEmbedder

Calculate document embeddings using OpenAI models deployed on Azure.

Basic Information

Type: haystack.components.preprocessors.document_splitter.DocumentSplitter
Components it can connect with:
- PreProcessors: AzureOpenAIDocumentEmbedder can receive the documents to embed from a PreProcessor, like DocumentSplitter.
- DocumentWriter: AzureOpenAIDocumentEmbedder can send the embedded documents to DocumentWriter that writes them into a document store.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		A list of documents to embed.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A list of documents with embeddings.
meta	Dict[str, Any]		Information about the usage of the model, including model name and token usage.

Overview

You can use AzureOpenAIDocumentEmbedder in your indexes to calculate vector representations (embeddings) of your documents. You need this to perform semantic-based retrieval, where you can search for documents that are similar to the user query. The retriever then compares the documents and query embeddings to find the most relevant documents.

Embedding Models in Query Pipelines and Indexes

The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.

This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.

Authentication

You need an Azure OpenAI API key to use this component. Connect deepset AI Platform to your Azure OpenAI account. For more information, see Using Azure OpenAI Models.

Usage Example

This is a simple index that uses AzureOpenAIDocumentEmbedder to embed the documents and write them into an OpenSearch document store.

components:
  ...
    splitter:
      type: haystack.components.preprocessors.document_splitter.DocumentSplitter
      init_parameters:
        split_by: word
        split_length: 250
        split_overlap: 30

    document_embedder:
      type: haystack.components.embedders.azure_document_embedder.AzureOpenAIDocumentEmbedder
      init_parameters:
        azure_deployment: "text-embedding-ada-002" # this is the name of the model you want to use

    writer:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
        policy: OVERWRITE
        
connections:  # Defines how the components are connected
  ...
  - sender: splitter.documents
    receiver: document_embedder.documents
  - sender: document_embedder.documents
    receiver: writer.documents

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
azure_endpoint	Optional[str]	None	The endpoint of the model deployed on Azure.
api_version	Optional[str]	2023-05-15	The version of the API to use.
azure_deployment	str	text-embedding-ada-002	The name of the model deployed on Azure. The default model is text-embedding-ada-002.
dimensions	Optional[int]	None	The number of dimensions of the resulting embeddings. Only supported in text-embedding-3 and later models.
api_key	Optional[Secret]	Secret.from_env_var('AZURE_OPENAI_API_KEY', strict=False)	The Azure OpenAI API key. You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this parameter during initialization.
azure_ad_token	Optional[Secret]	Secret.from_env_var('AZURE_OPENAI_AD_TOKEN', strict=False)	Microsoft Entra ID token, see Microsoft's Entra ID documentation for more information. You can set it with an environment variable `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. Previously called Azure Active Directory.
organization	Optional[str]	None	Your organization ID. See OpenAI's Setting Up Your Organization for more information.
prefix	str		A string to add at the beginning of each text.
suffix	str		A string to add at the end of each text.
batch_size	int	32	Number of documents to embed at once.
progress_bar	bool	True	If `True`, shows a progress bar when running.
meta_fields_to_embed	Optional[List[str]]	None	List of metadata fields to embed along with the document text.
embedding_separator	str	\n	Separator used to concatenate the metadata fields to the document text.
timeout	Optional[float]	None	The timeout for `AzureOpenAI` client calls, in seconds. If not set, defaults to either the `OPENAI_TIMEOUT` environment variable, or 30 seconds.
max_retries	Optional[int]	None	Maximum number of retries to contact AzureOpenAI after an internal error. If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable or to 5 retries.
default_headers	Optional[Dict[str, str]]	None	Default headers to send to the AzureOpenAI client.
azure_ad_token_provider	Optional[AzureADTokenProvider]	None	A function that returns an Azure Active Directory token, will be invoked on every request.
http_client_kwargs	Optional[Dict[str, Any]]	None	A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. For more information, see the HTTPX documentation.
raise_on_failure	bool	False	Whether to raise an exception if the embedding request fails. If `False`, the component will log the error and continue processing the remaining documents. If `True`, it will raise an exception on failure.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		A list of documents to embed.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Embedding Models in Query Pipelines and Indexes

Authentication​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​