Improving Your Document Search Pipeline
There are multiple ways to improve your document search pipeline, such as changing the retrieval type, choosing a different model, adding a ranker, and many more. This guide explains each of them, helping you choose the one suitable for your use case.
Changing the Model
If you’re using a vector-based retriever, such as EmbeddingRetriever, the first thing you can try to improve your pipeline is changing the retriever model. Look at the models for retrieval we recommend and try one.
Using the Hybrid Retrieval Approach
Combining a dense and a sparse retriever is another improvement you can try. Sparse, keyword-based retrievers can handle out-of-domain vocabulary and don’t need any training but are not great at capturing semantic nuances.
Dense, vector-based retrievers can understand the context and semantics but perform best on the domain they were trained on. By using both retriever types in one pipeline, you take advantage of the strengths of both, improving the efficiency of your system.
This is an example of a hybrid retrieval pipeline that uses the BM25Retriever and the EmbeddingRetriever, and then merges the documents they fetch using the JoinDocuments component. JoinDocuments removes duplicates and leaves only unique documents.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # The vector-based retriever
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: reciprocal_rank_fusion # Pushes the most relevant documents to the top
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Adding a Ranker
Ranker determines the relevance and order of documents in response to a query. Its primary goal is to present the most relevant documents to the user at the top of the search results. While working with our customers, we found that rankers are better at determining the relevance of documents than retrievers. Adding a ranker can significantly improve the results.
For ranking documents by relevance, we recommend using CohereRanker or SentenceTransformersRanker. For ranking documents on both relevance and recentness, we recommend RecentnessRanker.
Also, check the ranking models we recommend.
Here's an example hybrid retrieval pipeline with a SentenceTransformersRanker:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: concatenate # Combines documents from multiple retrievers
- name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
type: SentenceTransformersRanker
params:
model_name_or_path: cross-encoder/ms-marco-MiniLM-L-6-v2 # Fast model optimized for reranking
top_k: 20 # The number of results to return
batch_size: 30 # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a vector-based retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 30 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en # Used by NLTK to best detect the sentence boundaries for that language
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Reranker
inputs: [JoinResults]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Prioritizing Documents Based on Their Metadata
You may want to prioritize documents based on some information in their metadata, such as the number of likes a document got, the date it was created, the author, and so on. This is possible with BM25Retriever, EmbeddingRetriever, and rankers: SentenceTransformersRanker and CohereRanker.
Using BM25Retriever
Retrievers fetch documents from the DeepsetCloudDocumentStore, which, in fact, is OpenSearch. The BM25Retriever uses a default query to query OpenSearch for documents. You can customize this query to prioritize certain documents.
A couple of notes about this method:
- It's only possible with the BM25Retriever or in hybrid retrieval if the sparse retriever is BM25Retriever.
- OpenSearch queries are keyword-based, but they can go really deep, taking into account all rankings and dimensions of all documents in the document store.
- OpenSearch queries execute concurrently with retrieval, which means the most relevant documents are fetched from the document store.
For more guidance and examples of OpenSearch queries, see Boosting Retrieval with OpenSearch Queries.
Using EmbeddingRetriever or Rankers
EmbeddingRetriever, CohereRanker, and SentenceTransformersRanker have a parameter called embed_meta_fields
where you can pass the metadata fields you want to prioritize.
This example uses the company
metadata field to enhance embedding retrieval:
...
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: Retriever # Selects the most relevant documents from the document store
type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
embed_meta_fields:
[company]
...
To use more than one metadata field, simply separate them with a comma like this:
params:
embed_meta_fields:
[field1, field2]
Prioritizing Recent Documents
In certain document search systems, it’s important to retrieve the most recent documents. For example, when using a document search system that runs on news articles, you’d probably want the latest news to show first.
You can prioritize the latest documents by:
- Using RecentnessRanker
- Using a custom OpenSearch query
- Using both methods in one pipeline
RecentnessRanker
RecentnessRanker takes into account the relevance and recentness of the document. You can prioritize one over the other using the weight
parameter.
To prioritize documents based on their recentness, it uses a metadata field containing the date. It then sorts documents by the newest ones first. One thing to note here is that RecentnessRanker ranks documents after the retriever fetches them from the document store. We recommend RecentnessRanker if your pipeline uses EmbeddingRetriever only.
Here’s an example of a pipeline that uses RecentnessRanker:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
- name: EmbeddingRetriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: PM-AI/bi-encoder_msmarco_bert-base_german
model_format: sentence_transformers
top_k: 30
scale_score: false
embed_meta_fields:
[ressort, file_name]
- name: BM25Retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 30
- name: JoinDocuments
type: JoinDocuments
params:
top_k_join: 30
join_mode: reciprocal_rank_fusion
- name: Reranker
type: SentenceTransformersRanker
params:
model_name_or_path: svalabs/cross-electra-ms-marco-german-uncased
top_k: 15
- name: RecentnessReranker
type: RecentnessReranker
params:
date_meta_field: date_first_released
top_k: 8
method: score
- name: QueryClassifier
type: TransformersQueryClassifier
params:
model_name_or_path: JasperLS/gelectra-base-injection-pt_v1
labels: ['LEGIT','INJECTION']
- name: PromptNode
type: PromptNode
params:
default_prompt_template: deepset/question-answering
max_length: 650
model_kwargs:
temperature: 0
model_name_or_path: gpt-3.5-turbo
- name: FileTypeClassifier
type: FileTypeClassifier
- name: TextConverter
type: TextConverter
- name: PDFConverter
type: PDFToTextConverter
- name: Preprocessor
params:
language: de
split_by: word
split_length: 150
split_overlap: 10
split_respect_sentence_boundary: true
type: PreProcessor
pipelines:
- name: query
nodes:
- name: QueryClassifier
inputs: [Query]
- name: EmbeddingRetriever
inputs: [QueryClassifier.output_1]
- name: BM25Retriever
inputs: [QueryClassifier.output_1]
- name: JoinDocuments
inputs: [EmbeddingRetriever, BM25Retriever]
- name: Reranker
inputs: [JoinDocuments]
- name: RecentnessReranker
inputs: [Reranker]
- name: PromptNode
inputs: [RecentnessReranker]
- name: indexing
nodes:
- inputs:
- File
name: FileTypeClassifier
- inputs:
- FileTypeClassifier.output_1
name: TextConverter
- inputs:
- FileTypeClassifier.output_2
name: PDFConverter
- inputs:
- TextConverter
- PDFConverter
name: Preprocessor
- inputs:
- Preprocessor
name: EmbeddingRetriever
- inputs:
- EmbeddingRetriever
name: DocumentStore
OpenSearch Query
You can construct an OpenSearch query that prioritizes the most recent documents. To use this approach, your app must use BM25Retriever, either on its own or combined with a vector-based retriever. This is because only BM25Retriever supports custom OpenSerach queries.
A big advantage of this approach is that, unlike RecentnessRanker, which ranks the documents after they were retrieved, BM25Retriever uses this query when retrieving documents from the document store. This ensures the documents matching the criteria are retrieved from the database.
For an example of such a query, see Boosting Retrieval with OpenSearch Queries.
Combining Both
You can use both RecentnessRanker and a custom OpenSearch query in your document retrieval system. This way, you ensure that the retriever fetches the most recent documents from the document store, and then the ranker orders them by most recent before returning them as results to the user.
Combining both of these methods makes sense for pipelines using the hybrid retrieval (keyword-based and vector-based retriever) approach. Custom query provides more recent results from BM25Retriever, but the EmbeddingRetriever doesn't consider the date at all. This means that the results returned by JoinDocuments, the node that combines the documents from both retrievers, could be worse than BM25Retriever alone because of the documents retrieved by EmbeddingRetriever would be ranked too high. To overcome this, you can add a RecentnessRanker after the JoinDocuments node to ensure the resulting documents have their recentness taken into account.
Finding the Optimal PreProcessing Configuration
How your documents are preprocessed and prepared for querying influences the efficiency and accuracy of your app. PreProcessor is the component that chunks and cleans up your files during indexing. You can modify its settings so that the resulting documents are clean, structured, and normalized to ensure efficient querying.
Here are the preprocessor settings you can modify to improve your app:
split_by
- This specifies the unit by which you want to chunk your documents. Splitting by words is the safest option in most cases.
You can consider splitting bypassage
if your documents have a clear paragraph structure, with each paragraph describing one idea.split_respect_sentence_boundary
- This setting should beTrue
unless your source documents contain a lot of bullet point lists, markdown, or other forms of structured text. In those cases, settingsplit_respect_sentence_boundary
may result in really long documents.split_length
- This setting defines how big your documents can be at maximum. For example, if you choose to split your documents by words,split_length
defines the maximum number of words your documents can have.
This value can be bigger for keyword-based retrievers, but for vector-based retrievers, it must fit within the token limit of the retriever’s model. It’s best to check how many tokens the model was trained on and then set a value that fits within this limit. In our pipeline templates, we recommend setting this value to250
words, which seems to work best in most cases.
If your pipeline uses hybrid retrieval, you should adjustsplit_length
to the token limit of the vector-based retriever model.
Updated 9 months ago