Skip to main content

ElasticsearchBM25Retriever

Retrieves documents from the ElasticsearchDocumentStore using BM25 algorithm to find the keywords matching the user's query.

Basic Information

  • Type: haystack_integrations.components.retrievers.elasticsearch.bm25_retriever.ElasticsearchBM25Retriever
  • Components it can connect with:
    • Query: The Retriever receives the user query and searches for documents based on it.
    • Rankers: The Retriever can send the retrieved documents to a ranker.

Inputs

ParameterTypeDefaultDescription
querystrString to search in the document text.
filtersOptional[Dict[str, Any]]NoneFilters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. For details, check the Init Parameters section.
top_kOptional[int]NoneMaximum number of documents to return.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents that match the query.

Overview

ElasticsearchBM25Retriever is only compatible with ElasticsearchDocumentStore. It's a keyword-based retriever that uses the BM25 algorithm to find the most similar documents to a user's query. It determines the similarity between the query and the document by calculating the weighted word overlap between the two.

You can use it to find exact matches to names or product codes. It's lightweight and simple and performs well on out-of-domain data.

To combine keyword and embedding-based retrieval, you can use it together with ElasticsearchEmbeddingRetriever and then join the results of the two with a DocumentJoiner.

Usage Example

Using the Component in a Pipeline

This is an example of a document search pipeline that combines keyword-based retrieval with embedding-based retrieval. It uses ElasticsearchBM25Retriever and ElasticsearchEmbeddingRetriever to retrieve documents from the document store. It then joins the results of the two with a DocumentJoiner.


components:
query_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2

document_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate

ranker:
type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
init_parameters:
model: "intfloat/simlm-msmarco-reranker"
top_k: 20

ElasticsearchEmbeddingRetriever:
type: haystack_integrations.components.retrievers.elasticsearch.embedding_retriever.ElasticsearchEmbeddingRetriever
init_parameters:
filters:
top_k: 10
num_candidates:
filter_policy: replace
document_store:
type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
init_parameters:
hosts:
custom_mapping:
index: 'my_index'
embedding_similarity_function: cosine
ElasticsearchBM25Retriever:
type: haystack_integrations.components.retrievers.elasticsearch.bm25_retriever.ElasticsearchBM25Retriever
init_parameters:
filters:
fuzziness: AUTO
top_k: 10
scale_score: false
filter_policy: replace
document_store:
type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
init_parameters:
hosts:
custom_mapping:
index: 'my_index'
embedding_similarity_function: cosine

connections: # Defines how the components are connected
- sender: document_joiner.documents
receiver: ranker.documents
- sender: query_embedder.embedding
receiver: ElasticsearchEmbeddingRetriever.query_embedding
- sender: ElasticsearchEmbeddingRetriever.documents
receiver: document_joiner.documents
- sender: ElasticsearchBM25Retriever.documents
receiver: document_joiner.documents

inputs: # Define the inputs for your pipeline
query: # These components will receive the query as input
- "query_embedder.text"
- "ranker.query"
- ElasticsearchBM25Retriever.query

filters: # These components will receive a potential query filter as input
- "ElasticsearchEmbeddingRetriever.filters"
- "ElasticsearchBM25Retriever.filters"

outputs: # Defines the output of your pipeline
documents: "ranker.documents" # The output of the pipeline is the retrieved documents

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
document_storeElasticsearchDocumentStoreAn instance of ElasticsearchDocumentStore to retrieve documents freom.
filtersOptional[Dict[str, Any]]NoneFilters applied to the retrieved documents.
fuzzinessstrAUTOFuzziness parameter passed to Elasticsearch. For details, see Elasticsearch documentation.
top_kint10Maximum number of documents to return.
scale_scoreboolFalseIf True scales the Document`s scores between 0 and 1.
filter_policyUnion[str, FilterPolicy]FilterPolicy.REPLACEPolicy to determine how filters are applied. Possible options:
- REPLACE (default): Overrides the initialization filters with the filters specified at runtime. Use this policy to dynamically change filtering for specific queries.
- MERGE: Combines runtime filters with initialization filters to narrow down the search.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
querystrString to search in the Documents text.
filtersOptional[Dict[str, Any]]NoneFilters applied to the retrieved documents. The way runtime filters are applied depends on the filter_policy chosen.
top_kOptional[int]NoneMaximum number of Document to return.