OpenSearchHybridRetriever
Retrieve documents from OpenSearch using a combination of BM25 keyword search and embedding-based semantic search. The component runs both retrieval methods in parallel, then combines the results using a configurable join strategy. This hybrid approach typically provides better retrieval quality than using either method alone.
Key Features
- Combines BM25 keyword search and embedding-based semantic search in a single component.
- Configurable join strategies: Reciprocal Rank Fusion (RRF), concatenate, merge, and distribution-based rank fusion.
- Separate filter policies for BM25 and embedding retrieval.
- Configurable top-k limits for each retrieval method and for the final combined result.
- Built-in text embedder for converting queries to embeddings.
Configuration
- Drag the
OpenSearchHybridRetrievercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Configure the
embedder(a text embedder for semantic search, such asSentenceTransformersTextEmbedder). - Set
top_k_bm25andtop_k_embeddingto control the number of documents retrieved by each method. - Set
top_kto limit the final number of combined documents returned. - Optionally configure
filters_bm25andfilters_embedding.
- Configure the
- Go to the Advanced tab to configure
fuzziness,scale_score,all_terms_must_match,filter_policy_bm25,filter_policy_embedding,custom_query_bm25,custom_query_embedding,join_mode,weights, andsort_by_score.
Connections
OpenSearchHybridRetriever receives a query string from the Input component. It outputs a documents list. Connect the documents output to a PromptBuilder, a Ranker, or directly to an LLM component.
Source Code
To check this component's source code, open open_search_hybrid_retriever.py in the Haystack Core Integrations repository.
Usage Examples
Basic Configuration
hybrid_retriever:
type: haystack_integrations.components.retrievers.opensearch.open_search_hybrid_retriever.OpenSearchHybridRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
index: default
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
create_index: true
embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2
fuzziness: AUTO
top_k_bm25: 20
scale_score: false
all_terms_must_match: false
filter_policy_bm25: replace
top_k_embedding: 20
filter_policy_embedding: replace
join_mode: reciprocal_rank_fusion
top_k: 10
sort_by_score: true
This is an example RAG pipeline with OpenSearchHybridRetriever combining BM25 and embedding-based retrieval:
components:
hybrid_retriever:
type: haystack_integrations.components.retrievers.opensearch.open_search_hybrid_retriever.OpenSearchHybridRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'default'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2
filters_bm25:
fuzziness: AUTO
top_k_bm25: 20
scale_score: false
all_terms_must_match: false
filter_policy_bm25: replace
custom_query_bm25:
filters_embedding:
top_k_embedding: 20
filter_policy_embedding: replace
custom_query_embedding:
join_mode: reciprocal_rank_fusion
weights:
top_k: 10
sort_by_score: true
ranker:
type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
init_parameters:
model: intfloat/simlm-msmarco-reranker
top_k: 8
meta_field_grouping_ranker:
type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
init_parameters:
group_by: file_id
subgroup_by:
sort_docs_by: split_id
answer_builder:
type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
init_parameters:
reference_pattern: acm
PromptBuilder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: " You are a technical expert.\n You answer questions truthfully based on provided documents.\n If the answer exists in several documents, summarize them.\n Ignore documents that don't contain the answer to the question.\n Only answer based on the documents provided. Don't make things up.\n If no information related to the question can be found in the document, say so.\n Always use references in the form [NUMBER OF DOCUMENT] when using information from a document, e.g. [3] for Document [3] .\n Never name the documents, only enter a number in square brackets as a reference.\n The reference must only refer to the number that comes in square brackets after the document.\n Otherwise, do not use brackets in your answer and reference ONLY the number of the document without mentioning the word document.\n\n These are the documents:\n {%- if documents|length > 0 %}\n {%- for document in documents %}\n Document [{{ loop.index }}] :\n Name of Source File: {{ document.meta.file_name }}\n {{ document.content }}\n {% endfor -%}\n {%- else %}\n No relevant documents found.\n Respond with \"Sorry, no matching documents were found, please adjust the filters or try a different question.\"\n {% endif %}\n\n Question: {{ question }}\n Answer:"
required_variables:
variables:
OpenAIGenerator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: gpt-5-mini
streaming_callback:
api_base_url:
organization:
system_prompt:
generation_kwargs:
timeout:
max_retries:
http_client_kwargs:
connections:
- sender: hybrid_retriever.documents
receiver: ranker.documents
- sender: ranker.documents
receiver: meta_field_grouping_ranker.documents
- sender: meta_field_grouping_ranker.documents
receiver: answer_builder.documents
- sender: meta_field_grouping_ranker.documents
receiver: PromptBuilder.documents
- sender: PromptBuilder.prompt
receiver: OpenAIGenerator.prompt
- sender: OpenAIGenerator.replies
receiver: answer_builder.replies
inputs:
query:
- "hybrid_retriever.query"
- "ranker.query"
- "PromptBuilder.question"
- "answer_builder.query"
outputs:
documents: "meta_field_grouping_ranker.documents"
answers: "answer_builder.answers"
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | The query string to search for. | |
filters_bm25 | Optional[Dict[str, Any]] | None | Filters to apply during BM25 retrieval. |
filters_embedding | Optional[Dict[str, Any]] | None | Filters to apply during embedding retrieval. |
top_k | Optional[int] | None | Maximum number of documents to return from the combined results. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents retrieved and ranked using hybrid search. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
document_store | OpenSearchDocumentStore | An instance of OpenSearchDocumentStore to use with the retriever. | |
embedder | TextEmbedder | A text embedder to embed the query for semantic search. | |
filters_bm25 | Optional[Dict[str, Any]] | None | Default filters for BM25 retrieval. |
fuzziness | Union[int, str] | "AUTO" | The fuzziness setting for BM25 retrieval. |
top_k_bm25 | int | 10 | Number of documents to return from BM25 retrieval. |
scale_score | bool | False | Whether to scale the BM25 scores. |
all_terms_must_match | bool | False | Whether all query terms must match in BM25 retrieval. |
filter_policy_bm25 | Union[str, FilterPolicy] | "replace" | How to apply runtime filters for BM25. Options: replace, merge. |
custom_query_bm25 | Optional[Dict[str, Any]] | None | A custom OpenSearch query for BM25 retrieval. |
filters_embedding | Optional[Dict[str, Any]] | None | Default filters for embedding retrieval. |
top_k_embedding | int | 10 | Number of documents to return from embedding retrieval. |
filter_policy_embedding | Union[str, FilterPolicy] | "replace" | How to apply runtime filters for embedding retrieval. Options: replace, merge. |
custom_query_embedding | Optional[Dict[str, Any]] | None | A custom OpenSearch query for embedding retrieval. |
join_mode | Union[str, JoinMode] | "reciprocal_rank_fusion" | How to combine results from both retrievers. Options: concatenate, merge, reciprocal_rank_fusion, distribution_based_rank_fusion. |
weights | Optional[List[float]] | None | Weights for the joiner when combining results. |
top_k | Optional[int] | None | Final number of documents to return after combining results. |
sort_by_score | bool | True | Whether to sort the final results by score. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | The query string to search for. | |
filters_bm25 | Optional[Dict[str, Any]] | None | Filters to apply during BM25 retrieval. The way filters are applied depends on filter_policy_bm25. |
filters_embedding | Optional[Dict[str, Any]] | None | Filters to apply during embedding retrieval. The way filters are applied depends on filter_policy_embedding. |
top_k | Optional[int] | None | Maximum number of documents to return. Overrides the value set at initialization. |
Related Information
Was this page helpful?