DeepsetCloudDocumentStoreBM25Retriever
Retrieves documents from the deepset AI Platform using keyword search through the deepset Query API.
Basic Information
- Type:
dc_custom_component.components.retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
- Components it can connect with:
- Query: The Retriever receives the user query and searches for documents based on it.
- Rankers: The Retriever can send the retrieved documents to a ranker.
Inputs
Parameter | Type | Default | Description |
---|---|---|---|
query | str | The user query. | |
all_terms_must_match | Optional [bool] | None | Specifies whether all terms in the query must match. |
custom_query | Optional [dict[str, Any]] | None | A custom query to use for retrieval. |
filters | Optional [dict[str, Any]] | None | Filters to narrow down the search. |
top_k | Optional [int] | 10 | The maximum number of documents to retrieve. |
Outputs
Parameter | Type | Default | Description |
---|---|---|---|
documents | List[Document] | The documents matching the query. |
Overview
DeepsetCloudDocumentStoreBM25Retriever
queries documents stored in deepset AI Platform. It sends a query
to the deepset API and retrieves the most relevant documents based on the specified parameters. For details, see Query Documents endpoint.
DeepsetCloudDocumentStoreBM25Retriever
works with DeepsetCloudDocumentStore
. You can use it for example to query production data with pipelines that are in a different workspace.
Usage Example
Initializing the Component
components:
DeepsetCloudDocumentStoreBM25Retriever:
type: retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
init_parameters:
Using the Component in a Pipeline
This is an example of a document search pipeline that uses both DeepsetCloudDocumentStoreBM25Retriever
and DeepsetCloudDocumentStoreEmbeddingRetriever
to retrieve documents from the workspace called generative
using a pipeline called
test.
components:
TransformersSimilarityRanker_1:
type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
init_parameters:
model: svalabs/cross-electra-ms-marco-german-uncased
top_k: 15
model_kwargs:
torch_dtype: torch.float16
bm25_retriever:
type: dc_custom_component.components.retrievers.deepsetcloud_bm25.DeepsetCloudDocumentStoreBM25Retriever
init_parameters:
document_store:
type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
init_parameters:
workspace_name: generative
pipeline_name: test
dc_api_key:
type: env_var
env_vars:
- DC_TOKEN
strict: false
timeout: 10
top_k: 30
query_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: deepset/e5-multilingual-v1
progress_bar: false
embedding_retriever:
type: dc_custom_component.components.retrievers.deepsetcloud_embedding.DeepsetCloudDocumentStoreEmbeddingRetriever
init_parameters:
document_store:
type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
init_parameters:
workspace_name: generative
pipeline_name: test
dc_api_key:
type: env_var
env_vars:
- DC_TOKEN
strict: false
timeout: 10
top_k: 30
document_joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
top_k: 60
sort_by_score: false
ranker:
type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
init_parameters:
model: svalabs/cross-electra-ms-marco-german-uncased
top_k: 15
model_kwargs:
torch_dtype: torch.float16
RecursiveRetriever:
type: dc_custom_component.components.retrievers.recursive_retriever.DeepsetCloudRecursiveRetriever
init_parameters:
filter_key: passage_id
relevant_doc_keys:
- linked_passages
- backlinked_passages
document_store:
type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
init_parameters:
workspace_name: generative4
pipeline_name: test
dc_api_key:
type: env_var
env_vars:
- DC_TOKEN
strict: false
timeout: 10
top_k: 100
depth: 1
sampling_strategy:
- rank
- depth
- source
force_keep_original_documents: false
DocumentJoiner_1:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
weights:
top_k:
sort_by_score: true
DeepsetMetadataGrouper:
type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
init_parameters:
group_by: dokid
subgroup_by:
sort_docs_by: tokennr
DeepsetCNSentenceWindowRetriever_14:
type: dc_custom_component.components.retrievers.deepsetcn_sentence_window_retriever.DeepsetCNSentenceWindowRetriever
init_parameters:
document_store:
type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
init_parameters:
workspace_name: generative
pipeline_name: test
dc_api_key:
type: env_var
env_vars:
- DC_TOKEN
strict: false
docs_before: 1
docs_after: 1
split_id_field: tokennr
id_field: dokid
Recursive_retriever_linked_doks:
type: dc_custom_component.components.retrievers.recursive_retriever.DeepsetCloudRecursiveRetriever
init_parameters:
filter_key: dokid
relevant_doc_keys:
- linked_doks
depth: 1
sampling_strategy:
- rank
- depth
- source
force_keep_original_documents: false
document_store:
type: dc_custom_component.components.document_stores.deepsetcloud.DeepsetCloudDocumentStore
init_parameters:
workspace_name: generative
pipeline_name: test
dc_api_key:
type: env_var
env_vars:
- DC_TOKEN
strict: false
timeout: 10
top_k: 100
connections:
- sender: bm25_retriever.documents
receiver: document_joiner.documents
- sender: query_embedder.embedding
receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
receiver: document_joiner.documents
- sender: document_joiner.documents
receiver: ranker.documents
- sender: DocumentJoiner_1.documents
receiver: DeepsetMetadataGrouper.documents
- sender: DeepsetCNSentenceWindowRetriever_14.documents
receiver: DocumentJoiner_1.documents
- sender: RecursiveRetriever.documents
receiver: TransformersSimilarityRanker_1.documents
- sender: TransformersSimilarityRanker_1.documents
receiver: DocumentJoiner_1.documents
- sender: ranker.documents
receiver: RecursiveRetriever.documents
- sender: ranker.documents
receiver: DeepsetCNSentenceWindowRetriever_14.retrieved_documents
- sender: ranker.documents
receiver: Recursive_retriever_linked_doks.documents
- sender: Recursive_retriever_linked_doks.documents
receiver: DocumentJoiner_1.documents
max_runs_per_component: 100
metadata: {}
inputs:
query:
- query_embedder.text
- bm25_retriever.query
- ranker.query
- TransformersSimilarityRanker_1.query
filters:
- bm25_retriever.filters
- embedding_retriever.filters
outputs:
documents: DeepsetMetadataGrouper.documents
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
Parameter | Type | Default | Description |
---|---|---|---|
document_store | DeepsetCloudDocumentStore | The DeepsetCloudDocumentStore document store instance to use for retrieving documents. | |
all_terms_must_match | bool | None | Whether all terms in the query must match. |
top_k | int | 10 | The maximum number of documents to retrieve. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
Parameter | Type | Default | Description |
---|---|---|---|
query | str | The user query to use for retrieving documents. | |
all_terms_must_match | bool | None | Specifies whether all terms in the query must match the retrieved documents. |
custom_query | dict[str, Any] | None | A custom query for retrieval. |
filters | Dict[str, Any] | None | Filters to narrow down the search. |
top_k | int | 10 | The maximum number of documents to fetch. |
Updated about 12 hours ago