OpenSearchBM25Retriever
Fetch documents from OpenSearchDocumentStore using the keyword-based BM25 algorithm. It determines similarity between documents and the query based on weighted word overlap, making it well-suited for exact matches on names, IDs, and error messages.
Key Features
- Keyword-based retrieval using the BM25 algorithm.
- Supports fuzzy matching for approximate string matching.
- Filters can be set at initialization and overridden at query time.
- Supports custom OpenSearch queries for advanced use cases.
- Lightweight and fast, often competitive with embedding-based approaches on out-of-domain data.
Configuration
- Drag the
OpenSearchBM25Retrievercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- On the General tab:
- Select the document store. The document store determines where documents are retrieved from.
- Go to the Advanced tab to configure
top_k,fuzziness,all_terms_must_match,scale_score,custom_query, andfilter_policy.
Connections
OpenSearchBM25Retriever accepts a query string and optional filters, top_k, fuzziness, scale_score, all_terms_must_match, and custom_query as inputs. It outputs documents — a list of retrieved documents matching the query.
Typically, you connect OpenSearchBM25Retriever to a PromptBuilder, Ranker, or DocumentJoiner. If you need semantic similarity matching instead of keyword matching, use OpenSearchEmbeddingRetriever.
Usage Example
Using the Component in a Pipeline
This is an example of a simple keyword search pipeline where OpenSearchBM25Retriever retrieves documents matching the query.
components:
OpenSearchBM25Retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
filters:
fuzziness: AUTO
top_k: 10
scale_score: false
all_terms_must_match: false
filter_policy: replace
custom_query:
raise_on_failure: true
connections: []
max_runs_per_component: 100
metadata: {}
inputs:
query:
- OpenSearchBM25Retriever.query
filters:
- OpenSearchBM25Retriever.filters
outputs:
documents: OpenSearchBM25Retriever.documents
Using in a Hybrid Search Pipeline
This example shows a hybrid search pipeline that combines OpenSearchBM25Retriever (keyword search) with OpenSearchEmbeddingRetriever (semantic search). The results are joined and ranked.
components:
bm25_retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
filters:
fuzziness: AUTO
top_k: 20
scale_score: false
all_terms_must_match: false
filter_policy: replace
custom_query:
raise_on_failure: true
embedding_retriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
filters:
top_k: 20
filter_policy: replace
custom_query:
raise_on_failure: true
efficient_filtering: true
text_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
trust_remote_code: false
document_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
connections:
- sender: bm25_retriever.documents
receiver: document_joiner.documents
- sender: embedding_retriever.documents
receiver: document_joiner.documents
- sender: text_embedder.embedding
receiver: embedding_retriever.query_embedding
max_runs_per_component: 100
metadata: {}
inputs:
query:
- bm25_retriever.query
- text_embedder.text
filters:
- bm25_retriever.filters
- embedding_retriever.filters
outputs:
documents: document_joiner.documents
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| query | str | The query string. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy specified at Retriever's initialization. |
| all_terms_must_match | Optional[bool] | None | If True, all terms in the query string must be present in the retrieved documents. |
| top_k | Optional[int] | None | Maximum number of documents to return. |
| fuzziness | Optional[Union[int, str]] | None | Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query. |
| scale_score | Optional[bool] | None | If True, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. |
| custom_query | Optional[Dict[str, Any]] | None | A custom OpenSearch query. It must include a $query and may optionally include a $filters placeholder. An example custom_query: python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } } For this custom_query, a sample run() could be: python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of retrieved documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_store | OpenSearchDocumentStore | An instance of OpenSearchDocumentStore to use with the Retriever. | |
| filters | Optional[Dict[str, Any]] | None | Filters to narrow down the search for documents in the Document Store. |
| fuzziness | Union[int, str] | AUTO | Determines how approximate string matching is applied in full-text queries. This parameter sets the number of character edits (insertions, deletions, or substitutions) required to transform one word into another. For example, the "fuzziness" between the words "wined" and "wind" is 1 because only one edit is needed to match them. Use "AUTO" (the default) for automatic adjustment based on term length, which is optimal for most scenarios. For detailed guidance, refer to the OpenSearch fuzzy query documentation. |
| top_k | int | 10 | Maximum number of documents to return. |
| scale_score | bool | False | If True, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. |
| all_terms_must_match | bool | False | If True, all terms in the query string must be present in the retrieved documents. This is useful when searching for short text where even one term can make a difference. |
| filter_policy | Union[str, FilterPolicy] | FilterPolicy.REPLACE | Policy to determine how filters are applied. Possible options: - replace: Runtime filters replace initialization filters. Use this policy to change the filtering scope for specific queries. - merge: Runtime filters are merged with initialization filters. |
| custom_query | Optional[Dict[str, Any]] | None | The query containing a mandatory $query and an optional $filters placeholder. An example custom_query: python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } } An example run() method for this custom_query: python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
| raise_on_failure | bool | True | Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| query | str | The query string. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy specified at Retriever's initialization. |
| all_terms_must_match | Optional[bool] | None | If True, all terms in the query string must be present in the retrieved documents. |
| top_k | Optional[int] | None | Maximum number of documents to return. |
| fuzziness | Optional[Union[int, str]] | None | Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query. |
| scale_score | Optional[bool] | None | If True, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. |
| custom_query | Optional[Dict[str, Any]] | None | A custom OpenSearch query. It must include a $query and may optionally include a $filters placeholder. An example custom_query: python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } } For this custom_query, a sample run() could be: python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, ) |
Was this page helpful?