DeepsetCloudDocumentStoreBM25Retriever

Retrieves documents from the deepset AI Platform using keyword search through the deepset Query API.

Basic Information

  • Type: dc_custom_component.components.retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
  • Components it can connect with:
    • Query: The Retriever receives the user query and searches for documents based on it.
    • Rankers: The Retriever can send the retrieved documents to a ranker.

Inputs

ParameterTypeDefaultDescription
querystrThe user query.
all_terms_must_matchOptional [bool]NoneSpecifies whether all terms in the query must match.
custom_queryOptional [dict[str, Any]]NoneA custom query to use for retrieval.
filtersOptional [dict[str, Any]]NoneFilters to narrow down the search.
top_kOptional [int]10The maximum number of documents to retrieve.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]The documents matching the query.

Overview

DeepsetCloudDocumentStoreBM25Retriever queries documents stored in deepset AI Platform. It sends a query
to the deepset API and retrieves the most relevant documents based on the specified parameters. For details, see Query Documents endpoint.

DeepsetCloudDocumentStoreBM25Retriever works with DeepsetCloudDocumentStore. You can use it for example to query production data with pipelines that are in a different workspace.

Usage Example

Initializing the Component

components:
  DeepsetCloudDocumentStoreBM25Retriever:
    type: retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
    init_parameters:

Using the Component in a Pipeline

This is an example of a document search pipeline that uses both DeepsetCloudDocumentStoreBM25Retriever and DeepsetCloudDocumentStoreEmbeddingRetriever to retrieve documents from the workspace called generative using a pipeline called test.

components:
  TransformersSimilarityRanker_1:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: svalabs/cross-electra-ms-marco-german-uncased
      top_k: 15
      model_kwargs:
        torch_dtype: torch.float16
  bm25_retriever:
    type: dc_custom_component.components.retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
    init_parameters:
      document_store:
        type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
        init_parameters:
          workspace_name: generative
          pipeline_name: test
          dc_api_key:
            type: env_var
            env_vars:
            - DC_TOKEN
            strict: false
          timeout: 10
      top_k: 30
  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: deepset/e5-multilingual-v1
      progress_bar: false
  embedding_retriever:
    type: dc_custom_component.components.retrievers.deepsetcloud_embedding.DeepsetCloudDocumentStoreEmbeddingRetriever
    init_parameters:
      document_store:
        type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
        init_parameters:
          workspace_name: generative
          pipeline_name: test
          dc_api_key:
            type: env_var
            env_vars:
            - DC_TOKEN
            strict: false
          timeout: 10
      top_k: 30
  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      top_k: 60
      sort_by_score: false
  ranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: svalabs/cross-electra-ms-marco-german-uncased
      top_k: 15
      model_kwargs:
        torch_dtype: torch.float16
  RecursiveRetriever:
    type: dc_custom_component.components.retrievers.recursive_retriever.DeepsetCloudRecursiveRetriever
    init_parameters:
      filter_key: passage_id
      relevant_doc_keys:
      - linked_passages
      - backlinked_passages
      document_store:
        type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
        init_parameters:
          workspace_name: generative4
          pipeline_name: test
          dc_api_key:
            type: env_var
            env_vars:
            - DC_TOKEN
            strict: false
          timeout: 10
      top_k: 100
      depth: 1
      sampling_strategy:
      - rank
      - depth
      - source
      force_keep_original_documents: false
  DocumentJoiner_1:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      weights:
      top_k:
      sort_by_score: true
  DeepsetMetadataGrouper:
    type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
    init_parameters:
      group_by: dokid
      subgroup_by:
      sort_docs_by: tokennr
  DeepsetCNSentenceWindowRetriever_14:
    type: dc_custom_component.components.retrievers.deepsetcn_sentence_window_retriever.DeepsetCNSentenceWindowRetriever
    init_parameters:
      document_store:
        type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
        init_parameters:
          workspace_name: generative
          pipeline_name: test
          dc_api_key:
            type: env_var
            env_vars:
            - DC_TOKEN
            strict: false
      docs_before: 1
      docs_after: 1
      split_id_field: tokennr
      id_field: dokid
  Recursive_retriever_linked_doks:
    type: dc_custom_component.components.retrievers.recursive_retriever.DeepsetCloudRecursiveRetriever
    init_parameters:
      filter_key: dokid
      relevant_doc_keys:
      - linked_doks
      depth: 1
      sampling_strategy:
      - rank
      - depth
      - source
      force_keep_original_documents: false
      document_store:
        type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
        init_parameters:
          workspace_name: generative
          pipeline_name: test
          dc_api_key:
            type: env_var
            env_vars:
            - DC_TOKEN
            strict: false
          timeout: 10
      top_k: 100

connections:
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: DocumentJoiner_1.documents
  receiver: DeepsetMetadataGrouper.documents
- sender: DeepsetCNSentenceWindowRetriever_14.documents
  receiver: DocumentJoiner_1.documents
- sender: RecursiveRetriever.documents
  receiver: TransformersSimilarityRanker_1.documents
- sender: TransformersSimilarityRanker_1.documents
  receiver: DocumentJoiner_1.documents
- sender: ranker.documents
  receiver: RecursiveRetriever.documents
- sender: ranker.documents
  receiver: DeepsetCNSentenceWindowRetriever_14.retrieved_documents
- sender: ranker.documents
  receiver: Recursive_retriever_linked_doks.documents
- sender: Recursive_retriever_linked_doks.documents
  receiver: DocumentJoiner_1.documents

max_runs_per_component: 100

metadata: {}

inputs:
  query:
  - query_embedder.text
  - bm25_retriever.query
  - ranker.query
  - TransformersSimilarityRanker_1.query
  filters:
  - bm25_retriever.filters
  - embedding_retriever.filters

outputs:
  documents: DeepsetMetadataGrouper.documents

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
document_storeDeepsetCloudDocumentStoreThe DeepsetCloudDocumentStore document store instance to use for retrieving documents.
all_terms_must_matchboolNoneWhether all terms in the query must match.
top_kint10The maximum number of documents to retrieve.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
querystrThe user query to use for retrieving documents.
all_terms_must_matchboolNoneSpecifies whether all terms in the query must match the retrieved documents.
custom_querydict[str, Any]NoneA custom query for retrieval.
filtersDict[str, Any]NoneFilters to narrow down the search.
top_kint10The maximum number of documents to fetch.