Retriever

The Retriever is a lightweight filter that goes through the full document store and selects a set of candidate documents relevant to the query.

You can combine a Retriever with a Reader in your pipeline to speed up the search. The Retriever passes the documents it selected on to the Reader. This way, it saves the Reader from doing more work than it needs to.

If you use a Reader on its own, it returns whole documents as answers to the query.

Readers take a query as input and provide Documents as output.

Sparse and Dense Retrievers

There are two categories of retrievers: dense and sparse. Dense retrievers work with document embeddings. They embed both the documents and the query using deep neural networks, and then return the nearest neighbor documents as top candidates for the query.

Sparse retrievers are keyword-based. They look for shared keywords between documents and the query.

This table compares sparse and dense retrievers:

SparseDense
RetrieversTfidfRetriever
ElasticsearchRetriever (BM25)
ElasticsearchFilterOnlyRetriever
EmbeddingRetriever
DensePassageRetriever
FeaturesSimple but effective

Don't need to be trained

Work on any language

Don't take word order and syntax into account

Handle out-of-vocabulary words
Powerful but more expensive computationally, especially during indexing

Trained using labelled datasets

Language-specific

Use transformer-based encoders which take word order and syntax into account

Have problems with out-of-vocabulary words

Can build strong semantic representations of text
IndexingDone by creating an inverted-index, faster and less expensive than in dense retrieversDone by processing all documents through a neural network and storing the resulting vectors. Requires significant computational power and time

Usage

Retrievers are usually used as pipeline nodes, not on their own. If you want to experiment with a retriever, you can initialize it without adding it to a pipeline. A retriever always takes a document store as an argument. To initialize a retriever:

  1. Import the document store that you want to use with the retriever.
  2. Import the retriever.
  3. Initialize the retriever passing the document store as its argument. Specify all other parameters for the retriever. You can find the details in the sections that follow.
#Import the document store from Haystack:
from haystack.document_stores import DeepsetCloudDocumentStore

#Import the retriever:
from haystack.nodes import TfidfRetriever

#Set the document store:
document_store = DeepsetCloudDocumentStore()

#Initialize the retriever:
retriever = TfidfRetriever(document_store)

Arguments:

  • Retriever types:
    • TfidfRetriever
    • ElasticsearchRetriever
    • ElasticsearchFilterOnlyRetriever
    • EmbeddingRetriever
    • DensePassageRetriever
  • Document stores:
    • DeepsetCloudDocumentStore

Available Retrievers

When choosing the retriever type for your pipeline, first decide between dense and sparse retrievers and then consider the type of data you want to search on and the datastore you want to use.

TfidfRetriever

This retriever is based on the Term Frequency (TF) - Inverse Document Frequency (IDF) algorithm that counts word weight based on the number of occurrences of a word in a document.

Main Features

  • Favors documents that have more lexical overlap with the query
  • Favors words that occur in fewer documents over words that occur in many documents
  • Doesn't need a neural network
  • Doesn't need document embeddings

Document Stores You Can Use It With

DeepsetCloudDocumentStore

Usage

Here are the arguments you can specify for TfidfRetriever:

ArgumentTypePossible ValuesDescription
document_storeStringThe name of the document store.Specifies the instance of a document store from which the retriever retrieves the documents.

Currently, deepset Cloud supports DeepsetCloudDocumentStore only.
top_kIntegerSpecifies the number of documents to return for a query.
auto_fitBooleanTrue/FalseSpecifies whether to automatically update the TF-IDF matrix by calling the fit() method after new documents are added.

ElasticsearchRetriever (BM25)

Like the TfidfRetriever retriever, it is based on the TF-IDF algorithm.

Main Features

  • Doesn't need a neural network for indexing
  • Saturates term frequency (TF) after a set number of occurrences of the given term in the document
  • Favors short documents over long documents if they have the same amount of word overlap with the query
  • Combines with the Elastisearch document store only.

Document Stores You Can Use It With

DeepsetCloudDocumentStore

Usage

Here are the arguments you can specify for ElasticsearchRetriever:

ArgumentTypePossible ValuesDescription
document_storeStringThe name of the document store.Specifies the instance of a document store from which the retriever retrieves the documents.

Currently, deepset Cloud supports DeepsetCloudDocumentStore only.
custom_queryStringThe querySpecifies a query string with a mandatory query placeholder ${query}.
top_kIntegerSpecifies the number of documents to return for a query.

ElasticsearchFilterOnlyRetriever

A naive retriever that returns all documents matching the filters that you specify. It is helpful for benchmarking, testing, or for QA on small documents.

Main Features

  • Filters documents on metadata
  • Recommended for use with another retriever when you want to be able to filter your documents by certain attributes

Document Stores You Can Use It With

DeepsetCloudDocumentStore

EmbeddingRetriever

With EmbeddingRetriever, you can use a transformer model to encode the document and the query. Sentence transformers models are suited to this kind of retrieval. EmbeddingRetriever uses one embedding model to encode queries and documents.

Choosing the Right Model

deepset Cloud loads models directly from Hugging Face. If you're new to NLP, choosing the right model may be a difficult task. To make it easier, we suggest searching for a model on Hugging Face:

  1. Go to Hugging Face and click Models on the top menu.
  2. From the Tasks on the left, select Sentence Similarity and filter the models by Most Downloads. You get a list of the most popular models. It's best to start with one of them.

You can also have a look at the list of models that we recommend.

To use a private Hugging Face model, connect with deepset Cloud with Hugging Face first.

Main Features

  • Uses a transformer model
  • Uses one model to handle queries and documents

Document Stores You Can Use It With

DeepsetCloudDocumentStore

Usage

Here are the arguments you can specify for EmbeddingRetriever:

ArgumentTypePossible ValuesDescription
document_storeStringThe name of the document store.Specifies the instance of a document store from which the retriever retrieves the documents.

Currently, deepset Cloud supports DeepsetCloudDocumentStore only.
embedding_modelStringLocal path or the name of the model in the Hugging Face's model hub.
Example: sentence-transformers/all-MiniLM-L6-v2
Specifies the path to the embedding model for handling documents and query.
model_versionStringTag name, branch name, or commit hashSpecifies the version of the model to be used from the Hugging Face model hub.
use_gpuBooleanTrue/FalseSpecifies whether to use all available GPUs or the CPU. If no GPU is available, it falls back on CPU.
batch_sizeIntegerSpecifies the number of documents to encode at once.
max_seq_lenIntegerSpecifies the greatest length of each document sequence. It is the maximum number of tokens for the document text. Longer ones are cut down.
model_formatStringfarm
transformers
sentence_transformers
Specifies the name of the framework used for training the model.
pooling_strategyStringcls_token (sentence vector)
reduce_mean (sentence vector)
reduce_max (sentence vector)
per_token (individual token vectors)
Specifies the strategy for combining the embeddings from the model. Used for FARM and transformer models only.
emb_extraction_layerIntegerDefault: -1 (the last layer)Specifies the number of layers from which to extract the embeddings. Used for FARM and transformer models only.
progress_barBooleanTrue/FalseDisplays the progress bar during embedding.
devicesStringA list of GPU devicesContains a list of GPU devices to limit interference to certain GPUs and not use all available ones.
Note: Multi-GPU training is not implemented.
use_auth_tokenBooleanTrue/FalseSpecifies the API token used to download private models from Hugging Face. If set to True, the local token is used.

DensePassageRetriever (DPR)

DPR is a highly performing retriever that uses word embeddings to calculate document relevance. It is a bi-encoder, which means that it uses two different embedding models: one model to embed the query and one to embed the documents. Such a solution boosts the accuracy of the results it returns.

Choosing the Right Model

For DPR, you need to provide two models - one for the query and one for the documents, however the models must be trained on the same data. The easiest way to start is to go to Hugging Face and search for dpr. You'll get a list of DPR models sorted by Most Downloads, which means that the models at the top of the list are the most popular ones. Choose a ctx_encoder and a question_encoder model. You can also have a look at the list of models that we recommend.

If you want to use a private model hosted on Hugging Face, connect deepset Cloud with Hugging Face first.

To use a model, just type its Hugging Face location as the retriever parameter. deepset Cloud takes care of loading the model.

Main Features

  • Uses one BERT base model to encode documents
  • Uses one BERT base model to encode queries
  • Ranks documents by dot product similarity between query and document embeddings

Document Stores You Can Use It With

DeepsetCloudDocumentStore

Usage

Here are the arguments you can specify for DensePassageRetriever:

ArgumentTypePossible ValuesDescription
document_storeStringThe name of the document store.Specifies the instance of a document store from which the retriever retrieves the documents.

Currently, deepset Cloud supports DeepsetCloudDocumentStore only.
query_embedding_modelStringLocal path or the name of the model in the Hugging Face's model hub.
Example: sentence-transformers/all-MiniLM-L6-v2
Specifies the path to the embedding model for handling the query.
passage_embedding_modelStringLocal path or the name of the model in the Hugging Face's model hub.
Example: sentence-transformers/all-MiniLM-L6-v2
Specifies the path to the embedding model for handling the query.
model_versionStringTag name, branch name, or commit hashSpecifies the version of the model to be used from the Hugging Face model hub.
max_seq_len_queryIntegerSpecifies the greatest length of each query sequence. It is the maximum number of tokens for the document text. Longer ones are cut down.
max_seq_len_passageIntegerSpecifies the greatest length of each passage sequence. It is the maximum number of tokens for the document text. Longer ones are cut down.
top_kIntegerSpecifies the number of documents to return per query.
use_gpuBooleanTrue/FalseUses all available GPUs or the CPU. Falls back on the CPU if no GPU is available.
batch_sizeIntegerSpecifies the number of questions or passages to encode at once. If there are multiple GPUs, this value is the total batch size.
embed_titleBooleanTrue/FalseConcatenates the title and the passage to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval. The title is expected to be in doc.meta["name"] and you can supply it in the documents before writing them to the DocumentStore like this: {"text": "my text", "meta": {"name": "my title"}}.
use_fast_tokenizersBooleanTrue/FalseUses fast Rust tokenizers.
infer_tokenizer_classesBooleanTrue/FalseInfers the tokenizer class from the model configuration or name. If set to False, the class always loads DPRQuestionEncoderTokenizer and DPRContextEncoderTokenizer.
similarity_functionStringdot_product (default)
cosine
Specifies the function to apply for calculating the similarity of query and passage embeddings during training.
global_loss_buffer_sizeIntegerSpecifies the buffer size for allgather() in DDP. Increase this value if you encounter errors similar to _encoded data exceeds max_size....
progress_barBooleanTrue/FalseShows a tqdm progress bar. Disabling it in production deployments helps to keep the logs clean.
devicesStringA list of GPU devicesContains a list of GPU devices to limit inference to certain GPUs and not use all available GPUs. As multi-GPU training is currently not implemented for DPR, training only uses the first device provided in this list.
use_auth_tokenStringAPI tokenContains the API token used to download private models from Hugging Face. If set to True, the local token is used. You must first create this token using the transformer-cli login. For more information, see Transformers > Models,

Related Links