OpenSearchBM25Retriever

Fetch documents from OpenSearchDocumentStore using the keyword-based BM25 algorithm.

Basic Information

Type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
Components it can connect with:
- Input: OpenSearchBM25Retriever receives the query string from Input.
- PromptBuilder: OpenSearchBM25Retriever can send retrieved documents to PromptBuilder to be used in a prompt.
- Ranker: OpenSearchBM25Retriever can send retrieved documents to a Ranker to reorder them by relevance.
- DocumentJoiner: OpenSearchBM25Retriever can send documents to DocumentJoiner to combine with documents from other retrievers.

Inputs

Parameter	Type	Default	Description
query	str		The query string.
filters	Optional[Dict[str, Any]]	None	Filters applied to the retrieved documents. The way runtime filters are applied depends on the `filter_policy` specified at Retriever's initialization.
all_terms_must_match	Optional[bool]	None	If `True`, all terms in the query string must be present in the retrieved documents.
top_k	Optional[int]	None	Maximum number of documents to return.
fuzziness	Optional[Union[int, str]]	None	Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query.
scale_score	Optional[bool]	None	If `True`, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes.
custom_query	Optional[Dict[str, Any]]	None	A custom OpenSearch query. It must include a `$query` and may optionally include a `$filters` placeholder. An example custom_query: `python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } }` For this custom_query, a sample `run()` could be: `python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, )`

Outputs

Parameter	Type	Default	Description
documents	List[Document]		List of retrieved documents.

Overview

OpenSearchBM25Retriever is a keyword-based retriever that fetches documents matching a query from an OpenSearchDocumentStore. It determines the similarity between documents and the query based on the BM25 algorithm, which computes a weighted word overlap between the two strings.

Since OpenSearchBM25Retriever matches strings based on word overlap, it's often used to find exact matches to names of persons or products, IDs, or well-defined error messages. The BM25 algorithm is very lightweight and simple. Nevertheless, it can be hard to beat with more complex embedding-based approaches on out-of-domain data.

In addition to the query, the retriever accepts other optional parameters, including top_k (the maximum number of documents to retrieve) and filters to narrow down the search space. You can adjust how inexact fuzzy matching is performed using the fuzziness parameter. You can also specify if all terms in the query must match using all_terms_must_match.

If you want more flexible matching of a query to documents, use the OpenSearchEmbeddingRetriever, which uses vectors created by embedding models to retrieve relevant information.

Usage Example

Using the Component in a Pipeline

This is an example of a simple keyword search pipeline where OpenSearchBM25Retriever retrieves documents matching the query.

components:
  OpenSearchBM25Retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
          similarity: cosine
      filters:
      fuzziness: AUTO
      top_k: 10
      scale_score: false
      all_terms_must_match: false
      filter_policy: replace
      custom_query:
      raise_on_failure: true

connections: []

max_runs_per_component: 100

metadata: {}

inputs:
  query:
  - OpenSearchBM25Retriever.query
  filters:
  - OpenSearchBM25Retriever.filters

outputs:
  documents: OpenSearchBM25Retriever.documents

Using in a Hybrid Search Pipeline

This example shows a hybrid search pipeline that combines OpenSearchBM25Retriever (keyword search) with OpenSearchEmbeddingRetriever (semantic search). The results are joined and ranked.

components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
          similarity: cosine
      filters:
      fuzziness: AUTO
      top_k: 20
      scale_score: false
      all_terms_must_match: false
      filter_policy: replace
      custom_query:
      raise_on_failure: true
  embedding_retriever:
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
          similarity: cosine
      filters:
      top_k: 20
      filter_policy: replace
      custom_query:
      raise_on_failure: true
      efficient_filtering: true
  text_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: sentence-transformers/all-MiniLM-L6-v2
      device:
      token:
      prefix: ''
      suffix: ''
      batch_size: 32
      progress_bar: true
      normalize_embeddings: false
      trust_remote_code: false
  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

connections:
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: text_embedder.embedding
  receiver: embedding_retriever.query_embedding

max_runs_per_component: 100

metadata: {}

inputs:
  query:
  - bm25_retriever.query
  - text_embedder.text
  filters:
  - bm25_retriever.filters
  - embedding_retriever.filters

outputs:
  documents: document_joiner.documents

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
document_store	OpenSearchDocumentStore		An instance of OpenSearchDocumentStore to use with the Retriever.
filters	Optional[Dict[str, Any]]	None	Filters to narrow down the search for documents in the Document Store.
fuzziness	Union[int, str]	AUTO	Determines how approximate string matching is applied in full-text queries. This parameter sets the number of character edits (insertions, deletions, or substitutions) required to transform one word into another. For example, the "fuzziness" between the words "wined" and "wind" is 1 because only one edit is needed to match them. Use "AUTO" (the default) for automatic adjustment based on term length, which is optimal for most scenarios. For detailed guidance, refer to the OpenSearch fuzzy query documentation.
top_k	int	10	Maximum number of documents to return.
scale_score	bool	False	If `True`, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes.
all_terms_must_match	bool	False	If `True`, all terms in the query string must be present in the retrieved documents. This is useful when searching for short text where even one term can make a difference.
filter_policy	Union[str, FilterPolicy]	FilterPolicy.REPLACE	Policy to determine how filters are applied. Possible options: - `replace`: Runtime filters replace initialization filters. Use this policy to change the filtering scope for specific queries. - `merge`: Runtime filters are merged with initialization filters.
custom_query	Optional[Dict[str, Any]]	None	The query containing a mandatory `$query` and an optional `$filters` placeholder. An example custom_query: `python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } }` An example `run()` method for this `custom_query`: `python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, )`
raise_on_failure	bool	True	Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
query	str		The query string.
filters	Optional[Dict[str, Any]]	None	Filters applied to the retrieved documents. The way runtime filters are applied depends on the `filter_policy` specified at Retriever's initialization.
all_terms_must_match	Optional[bool]	None	If `True`, all terms in the query string must be present in the retrieved documents.
top_k	Optional[int]	None	Maximum number of documents to return.
fuzziness	Optional[Union[int, str]]	None	Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query.
scale_score	Optional[bool]	None	If `True`, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes.
custom_query	Optional[Dict[str, Any]]	None	A custom OpenSearch query. It must include a `$query` and may optionally include a `$filters` placeholder. An example custom_query: `python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } }` For this custom_query, a sample `run()` could be: `python retriever.run( query="Why did the revenue increase?", filters={ "operator": "AND", "conditions": [ {"field": "meta.years", "operator": "==", "value": "2019"}, {"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]}, ], }, )`

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Using the Component in a Pipeline​

Using in a Hybrid Search Pipeline​

Parameters​

Init Parameters​

Run Method Parameters​