Retriever
The Retriever is a lightweight filter that goes through the full DocumentStore and selects a set of candidate documents relevant to the query.
You can combine a Retriever with a Reader in your pipeline to speed up the search. The Retriever passes the Documents it selected on to the Reader. This way, it saves the Reader from doing more work than it needs to.
In a query pipeline, when given a query, the Retriever sifts through the Documents in the DocumentStore, scores each Document for its relevance to the query, and returns the top candidates. Retriever and DocumentStore are tightly coupled. When configuring a Retriever, you must always specify its DocumentStore.
When used in indexing pipelines, vector-based Retrievers (DensePassagerRetriever and EmbeddingRetriever) take Documents as input, and for each Document, they calculate its vector representation (embedding). This embedding is stored as part of the Document in the DocumentStore.
If you're using a keyword-based Retriever (BM25Retriever or TfidfRetriever) in your indexing pipeline, no embeddings are calculated, and the Retriever creates a keyword-based index that it then uses to look up the Documents quickly.
Basic Information
- Pipeline type: Used in query pipelines and in indexing pipelines.
- In indexing pipelines:
- Nodes that can precede it: PreProcessor
- Nodes that can follow it: DeepsetCloudDocumentStore
- Node input: Documents
- Node output: Documents
- In query pipelines:
- Nodes that can precede it: QueryClassifier
- Nodes that can follow it: Reader, Ranker, PromptNode, JoinDocuments
- Node input: Query
- Node output: Documents
- Available node classes: BM25Retriever, CNStaticFilterRetriever, DensePassageRetriever, EmbeddingRetriever, FilterRetriever, FileSimilarityRetriever, TfidfRetriever
Retrievers Overview
There are two categories of retrievers: vector-based (dense) and keyword-based (sparse) retrievers.
Vector-Based Retrievers
Vector-based retrievers work with document embeddings. They embed both the documents and the query using deep neural networks, and then return the documents most similar to the query as top candidates.
Main features:
- Powerful but more expensive computationally, especially during indexing.
- Trained using labeled datasets.
- Language-specific.
- Use transformer-based encoders that take word order and syntax into account.
- Can build strong semantic representations of text.
- Indexing is done by processing all documents through a neural network and storing the resulting vectors. Requires significant computational power and time.
Available vector-based retrievers:
- EmbeddingRetriever
- CNStaticFilterRetriever
- DensePassageRetriever
EmbeddingRetriever | DensePassageRetriever | CNStaticFilterRetriever | |
---|---|---|---|
Description | Uses one model to encode both the documents and the query. Can use a transformer model. Sentence transformers models are suited to this kind of retrieval. | A highly performing retriever that uses two different models: one model to embed the query and one to embed the documents. Such a solution boosts the accuracy of the results it returns. | This is an enhanced EmbeddingRetriever, it works the same but allows you to add filters at runtime for more targeted retrieval. |
Main Features | - Uses a transformer model - Uses one model to handle queries and documents | - Uses one BERT base model to encode documents. - Uses one BERT base model to encode queries. - Ranks documents by dot product similarity between the query and document embeddings. | - Uses a transformers model - Uses one model to handle queries and documents - Allows passing filters to narrow down the retrieval results |
Keyword-Based Retrievers
Keyword-based retrievers look for keywords shared between the documents and the query.
Main features:
- Simple but effective.
- Don't need to be trained.
- Work on any language.
- Don't take word order and syntax into account.
- Can't handle out-of-vocabulary words.
- Good for use cases where precise wording matters.
- Indexing is done by creating an inverted index, it's faster and less expensive than in vector-based retrievers.
Available keyword-based retrievers:
- TfidfRetriever
- BM25Retriever
- FilterRetriever
TfidfRetriever | BM25Retriever | FilterRetriever | |
---|---|---|---|
Description | Based on the Term Frequency (TF) - Inverse Document Frequency (IDF) algorithm that counts word weight based on the number of occurrences of a word in a document. | Based on the Term Frequency (TF) - Inverse Document Frequency (IDF) algorithm. | Retrieves all documents that match the given filters. It doesn't use the query to do that, just document metadata. |
Main Features | - Favors documents that have more lexical overlap with the query. - Favors words that occur in fewer documents over words that occur in many documents. - Doesn't need a neural network for indexing. - Doesn't need document embeddings. | - Doesn't need a neural network for indexing. - Saturates term frequency (TF) after a set number of occurrences of the given term in the document. - Favors short documents over long documents if they have the same amount of word overlap with the query. | - Filters documents using their metadata. - Recommended for use with another retriever when you want to filter your documents by certain attributes |
FileSimilarityRetriever
This retriever is flexible - you can configure it as keyword-based, vector-based, or hybrid. This depends on the primary and secondary retrievers you select for it. It compares the similarity of files by using these files as queries. It then generates a list of the most similar documents from each file query. You can configure the query input to accept the file ID, URL, file name, and more. The documents it returns are ranked in order of their similarity to the query file, with the most similar ones listed first.
To determine file similarity, the FileSimilarityRetriever employs a technique known as reciprocal rank fusion. This method takes each document derived from the query and conducts a separate retrieval operation for it. It then analyzes the results from each document query and assesses the overall file similarity. You can read about a similar approach in the PARM paper by Althammer et al. 2022.
Choosing the Right Retriever
Here are a couple of things to consider when choosing a retriever:
- Do you have a GPU available?
- Yes: We recommend using the EmbeddingRetriever.
- No: We recommend using the BM25Retriever.
Usage Examples
A Retriever always takes the DeepsetCloudDocumentStore as an argument.
...
components:
- name: DocumentStore # First, configure the DocumentStore for the Retriever
type: DeepsetCloudDocumentStore
- name: Retriever
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20
...
pipelines:
- name: query
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1]
- name: PDFConverter
inputs: [FileTypeClassifier.output_2]
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
params:
similarity: cosine
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: BAAI/bge-base-en-v1.5 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: FileSimilarityRetriever
type: FileSimilarityRetriever
params:
document_store: DocumentStore
primary_retriever: EmbeddingRetriever
top_k: 4
file_aggregation_key: file_name
max_num_queries: 50
....
pipelines:
- name: query
nodes:
- name: FileSimilarityRetriever
inputs: [Query]
# Here you'd need to define the indexing pipeline
For arguments you can specify for each retriever type, see the Arguments.
Parameters
Here are the parameters each retriever type can take when you configure it in the pipeline YAML.
BM25Retriever Parameters
Parameter | Type | Possible Values | Description |
---|---|---|---|
document_store | String | DeepsetCloudDocumentStore | Specifies the instance of a document store from which the retriever retrieves the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
top_k | Integer | Default: 10 | Specifies the number of documents to return for a query. Mandatory. |
all_terms_must_match | Boolean | True False (default) | Specifies if all terms in the query must match the document.True - Retrieves the document only if all terms from the query are also present in the document. Uses the AND operator implicitly, for example, "good vegetarian restaurant" looks for "good AND vegetarian AND restaurant.False - Retrieves the document if at least one query term exists in the document. Uses the OR operator implicitly, for example, "good vegetarian restaurant" looks for "good OR vegetarian OR restaurant".Mandatory. |
custom_query | String | The query | Specifies the optional OpenSearch query. For more information, see Boosting Retrieval with OpenSearch Queries. Optional. |
scale_score | Boolean | True (default)False | Scales the similarity score calculated to compare the similarity between the query and the documents to a unit interval in the range of 0 to 1, where 1 means extremely relevant.True - Scales similarity scores that naturally have a different value range, such as cosine or dot_product.False - Uses raw similarity scores.Mandatory. |
CNStaticFilterEmbeddingRetriever Parameters
Like EmbeddingRetriever, it requires a model to run. It takes exactly the same parameters as EmbeddingRetriever with an additional one filters
.
Parameter | Type | Possible Values | Description |
---|---|---|---|
embedding_model | String | Example: sentence-transformers/all-MiniLM-L6-v2 | Specifies the path to the embedding model for handling documents and query. This can be the path to a locally saved model or the model's name in the Hugging Face's model hub. Mandatory. |
document_store | String | DeepsetCloudDocumentStore | Specifies the instance of a document store from which the retriever retrieves the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
model_version | String | Tag name, branch name, or commit hash | Specifies the version of the model to be used from the Hugging Face model hub. Optional. |
use_gpu | Boolean | True (default)False | Specifies whether to use all available GPUs or the CPU. If no GPU is available, it falls back on the CPU. Mandatory. |
batch_size | Integer | Default: 32 | Specifies the number of documents to encode at once. Mandatory. |
max_seq_len | Integer | Default: 512 | Specifies the maximum number of tokens the document text can have. Longer documents are truncated. Mandatory. |
model_format | String | farm transformers sentence_transformers retribert openai cohere | Specifies the name of the framework used for saving the model or the model type. If you don't provide it, it's inferred from the model configuration files. Optional. |
pooling_strategy | String | cls_token (sentence vector)reduce_mean (default, sentence vector)reduce_max (sentence vector)per_token (individual token vectors) | Specifies the strategy for combining the embeddings from the model. Used for FARM and transformer models only. Mandatory. |
emb_extraction_layer | Integer | Default: -1 (the last layer) | Specifies the number of layers from which to extract the embeddings. Used for FARM and transformer models only. Mandatory. |
top_k | Integer | Default: 10 | Specifies the number of documents to retrieve. Mandatory. |
progress_bar | Boolean | True (default)False | Shows a tqdm progress bar. Disabling it in production deployments helps to keep the logs clean. Mandatory. |
devices | String | Example: [torch.device('cuda:0'), "mps", "cuda:1"] | Contains a list of GPU devices to limit inference to certain GPUs and not use all available ones. If you set use_gpu=False , this parameter is not used and a single CPU device is used for inference.As multi-GPU training is currently not implemented for EmbeddingRetriever, training only uses the first device provided in this list. Optional. |
use_auth_token | Union[str, bool] | The API token for downloading private models from Hugging Face.True - uses the token generated when running transformers-cli login (stored in ~/.huggingface. For more information, see Hugging Face.Optional. | |
scale_score | Boolean | True (default)False | Scales the similarity score calculated to measure the similarity between the query and documents to a unit interval in the range of 0 to 1, where 1 means extremely relevant.True - Scales similarity scores that naturally have a different value range, such as cosine or dot_product.False - Uses raw similarity scores.Mandatory. |
embed_meta_fields | List of strings | Concatenates the meta fields you specify and the text passage or table to a text pair that is then used to create the embedding. This approach is likely to improve performance if your metadata contain meaningful information for retrieval (for example, topic, entities, and the like). Optional. | |
api_key | String | The OpenAI API key or the Cohere API key. Required if you want to use OpenAI or Cohere embeddings. For more details, see OpenAI and Cohere documentation. Optional. | |
azure_api_version | String | Default: 2022-12-01 | The version of the Azure OpenAI API to use. Mandatory. |
azure_base_url | String | The base URL for the Azure OpenAI API. If not supplied, Azure OpenAI API is not used. This parameter is an OpenAI Azure endpoint, usually in the form https://.openai.azure.com Optional. | |
azure_deployment_name | String | The name of the Azure OpenAI API deployment. If not supplied, Azure OpenAI API is not used. Optional. | |
api_base | String | Default: "https://api.openai.com/v1" | The OpenAI API base URL. Required. |
openai_organization | String | Default: None | The OpenAI organization ID. For more details, see OpenAI documentation. Optional. |
filters | Dictionary | Default: None | A list of static filters (metadata fields) that can be overwritten at runtime. Optional. |
DensePassageRetriever (DPR) Parameters
You must choose the models you want this retriever to use to convert the documents into embeddings and then another model to convert the query into embeddings.
Choosing the Right Model
For DPR, you need two models - one for the query and one for the documents. The models must be trained on the same data.
The easiest way to start is to go to Hugging Face and search for dpr
. You'll get a list of DPR models sorted by Most Downloads, which means that the models at the top of the list are the most popular ones. Choose a ctx_encoder
and a question_encoder
model. You can also have a look at the list of models that we recommend.
If you want to use a private model hosted on Hugging Face, connect to model providers first.
To use a model, just type its Hugging Face location as the retriever parameter. deepset Cloud takes care of loading the model.
These are the parameters you can specify for DPR in pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
document_store | String | DeepsetCloudDocumentStore | Specifies the instance of a document store from which the retriever retrieves the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
query_embedding_model | String | Default: facebook/dpr-question_encoder-single-nq-base | Specifies the path to the embedding model for handling the query. This can be a path to a locally saved model or the name of the model in the Hugging Face model hub. Must be trained on the same data as the passage embedding model. Mandatory. |
passage_embedding_model | String | Default: facebook/dpr-ctx_encoder-single-nq-base | Specifies the path to the embedding model for handling the documents. This can be a path to a locally saved model or the name of the model in the Hugging Face model hub. Must be trained on the same data as the query embedding model. Mandatory. |
model_version | String | Tag name, branch name, or commit hash | Specifies the version of the model to be used from the Hugging Face model hub. Optional. |
max_seq_len_query | Integer | Default: 64 | Specifies the maximum number of tokens the query can have. Longer queries are truncated. Mandatory. |
max_seq_len_passage | Integer | Default: 256 | Specifies the maximum number of tokens the document text can have. Longer documents are truncated. Mandatory. |
top_k | Integer | Default: 10 | Specifies the number of documents to return per query. Mandatory. |
use_gpu | Boolean | True (default)False | Uses all available GPUs or the CPU. Falls back on the CPU if no GPU is available. Mandatory. |
batch_size | Integer | Default: 16 | Specifies the number of questions or passages to encode at once. If there are multiple GPUs, this value is the total batch size. Mandatory. |
embed_title | Boolean | True (default)False | Concatenates the title and the document to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval. The title is expected to be in doc.meta["name"] and you can provide it in the documents before writing them to the DocumentStore like this: {"text": "my text", "meta": {"name": "my title"}} .Mandatory. |
use_fast_tokenizers | Boolean | True (default)False | Uses fast Rust tokenizers. Mandatory. |
similarity_function | String | dot_product (default)cosine | Specifies the function to apply for calculating the similarity of query and passage embeddings during training. Mandatory. |
global_loss_buffer_size | Integer | Default: 150000 | Specifies the buffer size for all_gather() in DDP. Increase this value if you encounter errors similar to "encoded data exceeds max_size...".Mandatory. |
progress_bar | Boolean | True (deault)False | Shows a tqdm progress bar. Disabling it in production deployments helps to keep the logs clean. Mandatory. |
devices | String | A list of GPU devices Example: [torch.device('cuda:0'), "mps", "cuda:1"] | Contains a list of GPU devices to limit inference to certain GPUs and not use all available GPUs. As multi-GPU training is currently not implemented for DPR, training only uses the first device provided in this list. Optional. |
use_auth_token | Union[str, bool] | Contains the API token used to download private models from Hugging Face. If set to True , the local token is used. You must first create this token using the transformer-cli login. For more information, see Transformers > Models,Optional. | |
scale_score | Boolean | True (default)False | Scales the similarity score calculated to compare the similarity of the query and the documents to a unit interval in the range of 0 to 1, where 1 means extremely relevant.True - Scales similarity scores that naturally have a different value range, such as cosine or dot_product.False - Uses raw similarity scores.Mandatory. |
EmbeddingRetriever Parameters
You must specify the model you want this retriever to use to embed the query and the documents. The embedding model can be a Hugging Face model, an OpenAI model (like "aga", "babbage", "curie"), a Cohere model ("embed-english-v2.0", "embed-english-light-v2.0", "embed-multilingual-v2.0"), or an AWS Bedrock model ("amazon.titan-embed-text-v1", "cohere.embed-english-v3", "cohere.embed-multilingual-v3").
You can also have a look at the list of models that we recommend.
To use models from a provider, Connect to Model Providers first.
These are the parameters you can pass in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
embedding_model | String | Example: sentence-transformers/all-MiniLM-L6-v2 | Specifies the path to the embedding model for handling documents and query. This can be the path to a locally saved model or the model's name. Mandatory. |
document_store | String | DeepsetCloudDocumentStore | Specifies the instance of a document store from which the retriever retrieves the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
model_version | String | Tag name, branch name, or commit hash | Specifies the version of the model to be used from the Hugging Face model hub. Optional. |
use_gpu | Boolean | True (default)False | Specifies whether to use all available GPUs or the CPU. If no GPU is available, it falls back on the CPU. Mandatory. |
batch_size | Integer | Default: 32 | Specifies the number of documents to encode at once. Mandatory. |
max_seq_len | Integer | Default: 512 | Specifies the maximum number of tokens the document text can have. Longer documents are truncated. Mandatory. |
model_format | String | farm transformers sentence_transformers retribert openai cohere | Specifies the name of the framework used for saving the model or the model type. If you don't provide it, it's inferred from the model configuration files. Optional. |
query_prompt | String | Default: None | Instructions for the model to embed the text of the query. Optional. |
passage_prompt | String | Default: None | Instructions for the model to embed the text of the documents to be retrieved. Optional. |
pooling_strategy | String | cls_token (sentence vector)reduce_mean (default, sentence vector)reduce_max (sentence vector)per_token (individual token vectors) | Specifies the strategy for combining the embeddings from the model. Used for FARM and transformer models only. Mandatory. |
emb_extraction_layer | Integer | Default: -1 (the last layer) | Specifies the number of layers from which to extract the embeddings. Used for FARM and transformer models only. Mandatory. |
top_k | Integer | Default: 10 | Specifies the number of documents to retrieve. Mandatory. |
progress_bar | Boolean | True (default)False | Shows a tqdm progress bar. Disabling it in production deployments helps to keep the logs clean. Mandatory. |
devices | String | Example: [torch.device('cuda:0'), "mps", "cuda:1"] | Contains a list of GPU devices to limit inference to certain GPUs and not use all available ones. If you set use_gpu=False , this parameter is not used and a single CPU device is used for inference.As multi-GPU training is currently not implemented for EmbeddingRetriever, training only uses the first device provided in this list. Optional. |
use_auth_token | Union[str, bool] | Default: None | The API token for downloading private models from Hugging Face.True - uses the token generated when running transformers-cli login (stored in ~/.huggingface. For more information, see Hugging Face.Optional. |
scale_score | Boolean | True (default)False | Scales the similarity score calculated to measure the similarity between the query and documents to a unit interval in the range of 0 to 1, where 1 means extremely relevant.True - Scales similarity scores that naturally have a different value range, such as cosine or dot_product.False - Uses raw similarity scores.Mandatory. |
embed_meta_fields | List of strings | Default: None | Concatenates the meta fields you specify and the text passage or table to a text pair that is then used to create the embedding. This approach is likely to improve performance if your metadata contain meaningful information for retrieval (for example, topic, entities, and the like). Optional. |
api_key | String | Default: None | The OpenAI API key or the Cohere API key. Required if you want to use OpenAI or Cohere embeddings. For more details, see OpenAI and Cohere documentation. Optional. |
azure_api_version | String | Default: 2022-12-01 | The version of the Azure OpenAI API to use. Mandatory. |
azure_base_url | String | Default: None | The base URL for the Azure OpenAI API. If not supplied, Azure OpenAI API is not used. This parameter is an OpenAI Azure endpoint, usually in the form https://.openai.azure.com Optional. |
azure_deployment_name | String | Default: None | The name of the Azure OpenAI API deployment. If not supplied, Azure OpenAI API is not used. Optional. |
api_base | String | Default: https://api.openai.com/v1 | The OpenAI API base URL. Required. |
openai_organization | String | Default: None | The OpenAI organization ID. For more details, see OpenAI documentation. Optional. |
aws_config | Dictionary[string, any] | Default: None | The aws_config contains {aws_access_key, aws_secret_key, aws_region, profile_name} to use with the boto3 session for an AWS Bedrock retriever. Optional. |
FileSimilarityRetriever Parameters
You can configure the following parameters in pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
document_store | String | Default: KeywordDocumentStore | The instance of DeepsetCloudDocumentStore to retriever from. Mandatory. |
file_aggregation_key | String | Default: file_id | The metadata key from the file metadata that you want to use to aggregate documents to the file level. This is what you pass as query. For example, if you have a metadata key called "file_name" which contains the name of the file, you can set it as the file_aggregation_key . Then, you pass the file_name value as query and the retriever finds documents similar to this file.Mandatory. |
primary_retriever | String | Default: None | The name of the primary retriever to use. Optional. |
secondary_retriever | String | Default: None | The name of the secondary retriever to use. Optional. |
keep_original_score | String | Default: None | Stores the original score of the returned document in the document's metadata. Replaces the document's score property with the reciprocal rank fusion score. Optional. |
top_k | Integer | Default: 10 | The number of documents to return. Mandatory. |
max_query_len | Integer | Default: 6000 | The number of characters the query document can have. If a document is longer than the specified length, it's cut off. Mandatory. |
max_num_queries | Integer | Default: None | The maximum number of queries that can be run for a single file. If the number of query documents exceeds this limit, the query documents are split into n parts so that n < max_num_queries and every nth document is kept.Optional. |
use_existing_embedding | Boolean | True False Default: True | Reuses existing embeddings from the index. To optimize the speed, set this to True . This way, the FileSimilarityRetriever can run on the CPU.Mandatory. |
FilterRetriever Parameters
These are the parameters you can specify in pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
document_store | String | DeepsetCloudDocumentStore | Specifies the document store from where the retriever fetches the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
top_k | Integer | Default: 10 | The number of documents to fetch. Mandatory. |
all_terms_must_match | Boolean | True False (default) | Specifies if all terms of the query must match the document.True retrieves the document only if all terms from the query are also present in the document. It uses the AND operator implicitly. For example, "good vegetarian restaurant" looks for "good AND vegetarian AND restaurant".False retrieves the document if at least one query term exists in the document. It uses the OR operator implicitly. For example, "good vegetarian restaurant" looks for "good OR vegetarian OR restaurant".Mandatory. |
custom_query | String | Specifies the custom OpenSearch query. For more information, see Boosting Retrieval with OpenSearch Queries. Optional. | |
scale_score | Boolean | True (default)False | Scales the similarity score calculated for the query and the documents to a unit interval in the range of 0 to 1, where 1 means extremely relevant.True - Scales similarity scores that naturally have a different value range, such as cosine or dot_product.False - Uses raw similarity scores.Mandatory. |
TfidfRetriever Parameters
These are the parameters you can configure in pipeline YAML:
Argument | Type | Possible Values | Description |
---|---|---|---|
document_store | String | DeepsetCloudDocumentStore | Specifies the document store from which the retriever retrieves the documents. deepset Cloud supports DeepsetCloudDocumentStore only.Optional. |
top_k | Integer | Default: 10 | Specifies the number of documents to return for a query. Mandatory. |
auto_fit | Boolean | True (default)False | Specifies whether to automatically update the TF-IDF matrix by calling the fit() method after new documents are added.Mandatory. |
Updated 7 months ago