OpenSearchEmbeddingRetriever
Retrieve documents from the OpenSearchDocumentStore using a vector similarity metric.
Basic Information
- Type:
haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever - Components it can connect with:
- Text Embedders:
OpenSearchEmbeddingRetrieverreceives the query embedding from a text embedder likeSentenceTransformersTextEmbedderorOpenAITextEmbedder. PromptBuilder:OpenSearchEmbeddingRetrievercan send retrieved documents toPromptBuilderto be used in a prompt.Ranker:OpenSearchEmbeddingRetrievercan send retrieved documents to aRankerto reorder them by relevance.DocumentJoiner:OpenSearchEmbeddingRetrievercan send documents toDocumentJoinerto combine with documents from other retrievers.
- Text Embedders:
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| query_embedding | List[float] | Embedding of the query. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returns top_k matching documents. The way runtime filters are applied depends on the filter_policy selected when initializing the Retriever. |
| top_k | Optional[int] | None | Maximum number of documents to return. |
| custom_query | Optional[Dict[str, Any]] | None | A custom OpenSearch query containing a mandatory $query_embedding and an optional $filters placeholder. An example custom_query: python { "query": { "bool": { "must": [ { "knn": { "embedding": { "vector": "$query_embedding", // mandatory query placeholder "k": 10000, } } } ], "filter": "$filters" // optional filter placeholder } } } For this custom_query, an example run() could be: python retriever.run( query_embedding=embedding, filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
| efficient_filtering | Optional[bool] | None | If True, the filter will be applied during the approximate kNN search. This is only supported for knn engines "faiss" and "lucene" and does not work with the default "nmslib". |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents similar to the query embedding. |
Overview
OpenSearchEmbeddingRetriever is an embedding-based retriever compatible with the OpenSearchDocumentStore. It compares the query and document embeddings and fetches the documents most relevant to the query from the document store based on vector similarity.
When using OpenSearchEmbeddingRetriever in your pipeline, make sure it has the query and document embeddings available. Add a document embedder to your indexing pipeline and a text embedder to your query pipeline to create these embeddings.
In addition to the query_embedding, the retriever accepts other optional parameters, including top_k (the maximum number of documents to retrieve) and filters to narrow down the search space.
The embedding_dim for storing and retrieving embeddings must be defined when the corresponding OpenSearchDocumentStore is initialized.
If you want exact keyword matching instead of semantic similarity, use the OpenSearchBM25Retriever.
Usage Example
Using the Component in a Pipeline
This is an example of a semantic search pipeline where OpenSearchEmbeddingRetriever receives the query embedding from a text embedder and retrieves matching documents.
components:
text_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
trust_remote_code: false
OpenSearchEmbeddingRetriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
filters:
top_k: 10
filter_policy: replace
custom_query:
raise_on_failure: true
efficient_filtering: true
connections:
- sender: text_embedder.embedding
receiver: OpenSearchEmbeddingRetriever.query_embedding
max_runs_per_component: 100
metadata: {}
inputs:
query:
- text_embedder.text
filters:
- OpenSearchEmbeddingRetriever.filters
outputs:
documents: OpenSearchEmbeddingRetriever.documents
Using in a RAG Pipeline
This example shows a RAG pipeline that uses OpenSearchEmbeddingRetriever to find relevant documents, then passes them to a generator to answer a question.
components:
text_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
trust_remote_code: false
retriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
filters:
top_k: 10
filter_policy: replace
custom_query:
raise_on_failure: true
efficient_filtering: true
prompt_builder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
required_variables: "*"
template: |-
Given the following documents, answer the question.
Documents:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
generator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: true
model: gpt-4o-mini
generation_kwargs:
answer_builder:
type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
init_parameters:
reference_pattern: acm
connections:
- sender: text_embedder.embedding
receiver: retriever.query_embedding
- sender: retriever.documents
receiver: prompt_builder.documents
- sender: prompt_builder.prompt
receiver: generator.prompt
- sender: generator.replies
receiver: answer_builder.replies
- sender: retriever.documents
receiver: answer_builder.documents
- sender: prompt_builder.prompt
receiver: answer_builder.prompt
max_runs_per_component: 100
metadata: {}
inputs:
query:
- text_embedder.text
- prompt_builder.question
- answer_builder.query
filters:
- retriever.filters
outputs:
documents: retriever.documents
answers: answer_builder.answers
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_store | OpenSearchDocumentStore | An instance of OpenSearchDocumentStore to use with the Retriever. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returns top_k matching documents. |
| top_k | int | 10 | Maximum number of documents to return. |
| filter_policy | Union[str, FilterPolicy] | FilterPolicy.REPLACE | Policy to determine how filters are applied. Possible options: - merge: Runtime filters are merged with initialization filters. - replace: Runtime filters replace initialization filters. Use this policy to change the filtering scope. |
| custom_query | Optional[Dict[str, Any]] | None | The custom OpenSearch query containing a mandatory $query_embedding and an optional $filters placeholder. An example custom_query: python { "query": { "bool": { "must": [ { "knn": { "embedding": { "vector": "$query_embedding", // mandatory query placeholder "k": 10000, } } } ], "filter": "$filters" // optional filter placeholder } } } For this custom_query, an example run() could be: python retriever.run( query_embedding=embedding, filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
| raise_on_failure | bool | True | If True, raises an exception if the API call fails. If False, logs a warning and returns an empty list. |
| efficient_filtering | bool | False | If True, the filter will be applied during the approximate kNN search. This is only supported for knn engines "faiss" and "lucene" and does not work with the default "nmslib". |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| query_embedding | List[float] | Embedding of the query. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returns top_k matching documents. The way runtime filters are applied depends on the filter_policy selected when initializing the Retriever. |
| top_k | Optional[int] | None | Maximum number of documents to return. |
| custom_query | Optional[Dict[str, Any]] | None | A custom OpenSearch query containing a mandatory $query_embedding and an optional $filters placeholder. An example custom_query: python { "query": { "bool": { "must": [ { "knn": { "embedding": { "vector": "$query_embedding", // mandatory query placeholder "k": 10000, } } } ], "filter": "$filters" // optional filter placeholder } } } For this custom_query, an example run() could be: python retriever.run( query_embedding=embedding, filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
| efficient_filtering | Optional[bool] | None | If True, the filter will be applied during the approximate kNN search. This is only supported for knn engines "faiss" and "lucene" and does not work with the default "nmslib". |
Was this page helpful?