ElasticsearchBM25Retriever
Retrieve documents from the ElasticsearchDocumentStore using the BM25 algorithm to find keywords matching the user's query.
Key Features
- Keyword-based retrieval using the BM25 algorithm.
- Only compatible with
ElasticsearchDocumentStore. - Calculates weighted word overlap between the query and documents.
- Works well for exact keyword matching, product codes, and proper names.
- Lightweight and performs well on out-of-domain data.
- Can be combined with
ElasticsearchEmbeddingRetrieverandDocumentJoinerfor hybrid retrieval.
Configuration
- Drag the
ElasticsearchBM25Retrievercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Configure the
document_storeconnection with your Elasticsearch hosts and index name. - Set
top_kto the maximum number of documents to return.
- Configure the
- Go to the Advanced tab to configure
filters,fuzziness,scale_score, andfilter_policy.
Connections
ElasticsearchBM25Retriever receives a query string from the Input component. It outputs a list of matching documents through its documents output. You can connect its output to a ranker or DocumentJoiner for hybrid retrieval with ElasticsearchEmbeddingRetriever.
Source Code
To check this component's source code, open bm25_retriever.py in the Haystack Core Integrations repository.
Usage Examples
Basic Configuration
ElasticsearchBM25Retriever:
type: haystack_integrations.components.retrievers.elasticsearch.bm25_retriever.ElasticsearchBM25Retriever
init_parameters:
fuzziness: AUTO
top_k: 10
scale_score: false
filter_policy: replace
document_store:
type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
init_parameters:
index: my_index
embedding_similarity_function: cosine
Using the Component in a Pipeline
This is an example of a document search pipeline that combines keyword-based retrieval with embedding-based retrieval. It uses ElasticsearchBM25Retriever and ElasticsearchEmbeddingRetriever to retrieve documents from the document store. It then joins the results of the two with a DocumentJoiner.
components:
query_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2
document_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
ranker:
type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
init_parameters:
model: "intfloat/simlm-msmarco-reranker"
top_k: 20
ElasticsearchEmbeddingRetriever:
type: haystack_integrations.components.retrievers.elasticsearch.embedding_retriever.ElasticsearchEmbeddingRetriever
init_parameters:
filters:
top_k: 10
num_candidates:
filter_policy: replace
document_store:
type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
init_parameters:
hosts:
custom_mapping:
index: 'my_index'
embedding_similarity_function: cosine
ElasticsearchBM25Retriever:
type: haystack_integrations.components.retrievers.elasticsearch.bm25_retriever.ElasticsearchBM25Retriever
init_parameters:
filters:
fuzziness: AUTO
top_k: 10
scale_score: false
filter_policy: replace
document_store:
type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
init_parameters:
hosts:
custom_mapping:
index: 'my_index'
embedding_similarity_function: cosine
connections: # Defines how the components are connected
- sender: document_joiner.documents
receiver: ranker.documents
- sender: query_embedder.embedding
receiver: ElasticsearchEmbeddingRetriever.query_embedding
- sender: ElasticsearchEmbeddingRetriever.documents
receiver: document_joiner.documents
- sender: ElasticsearchBM25Retriever.documents
receiver: document_joiner.documents
inputs: # Define the inputs for your pipeline
query: # These components will receive the query as input
- "query_embedder.text"
- "ranker.query"
- ElasticsearchBM25Retriever.query
filters: # These components will receive a potential query filter as input
- "ElasticsearchEmbeddingRetriever.filters"
- "ElasticsearchBM25Retriever.filters"
outputs: # Defines the output of your pipeline
documents: "ranker.documents" # The output of the pipeline is the retrieved documents
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| query | str | String to search in the document text. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. For details, check the Init Parameters section. |
| top_k | Optional[int] | None | Maximum number of documents to return. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents that match the query. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_store | ElasticsearchDocumentStore | An instance of ElasticsearchDocumentStore to retrieve documents from. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied to the retrieved documents. |
| fuzziness | str | AUTO | Fuzziness parameter passed to Elasticsearch. For details, see Elasticsearch documentation. |
| top_k | int | 10 | Maximum number of documents to return. |
| scale_score | bool | False | If True scales the Document`s scores between 0 and 1. |
| filter_policy | Union[str, FilterPolicy] | FilterPolicy.REPLACE | Policy to determine how filters are applied. Possible options: - REPLACE (default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries. - MERGE: Combines runtime filters with initialization filters to narrow down the search. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| query | str | String to search in the Documents text. | |
| filters | Optional[Dict[str, Any]] | None | Filters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy chosen. |
| top_k | Optional[int] | None | Maximum number of Document to return. |
Related Information
Was this page helpful?