Improving Your Document Search Pipeline

There are multiple ways to improve your document search pipeline, such as changing the retrieval type, choosing a different model, adding a ranker, and many more. This guide explains each of them, helping you choose the one suitable for your use case.

Changing the Model

If you’re using a vector-based retriever, such as EmbeddingRetriever, the first thing you can try to improve your pipeline is changing the retriever model. Look at the models for retrieval we recommend and try one.

Using the Hybrid Retrieval Approach

Combining a dense and a sparse retriever is another improvement you can try. Sparse, keyword-based retrievers can handle out-of-domain vocabulary and don’t need any training but are not great at capturing semantic nuances.

Dense, vector-based retrievers can understand the context and semantics but perform best on the domain they were trained on. By using both retriever types in one pipeline, you take advantage of the strengths of both, improving the efficiency of your system.

This is an example of a hybrid retrieval pipeline that uses the BM25Retriever and the EmbeddingRetriever, and then merges the documents they fetch using the JoinDocuments component. JoinDocuments removes duplicates and leaves only unique documents.



components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore 
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # The vector-based retriever
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion # Pushes the most relevant documents to the top
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Adding a Ranker

Ranker determines the relevance and order of documents in response to a query. Its primary goal is to present the most relevant documents to the user at the top of the search results. While working with our customers, we found that rankers are better at determining the relevance of documents than retrievers. Adding a ranker can significantly improve the results.

For ranking documents by relevance, we recommend using CohereRanker or SentenceTransformersRanker. For ranking documents on both relevance and recentness, we recommend RecentnessRanker.

Also, check the ranking models we recommend.

Here's an example hybrid retrieval pipeline with a SentenceTransformersRanker:


 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: cross-encoder/ms-marco-MiniLM-L-6-v2 # Fast model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 30  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Prioritizing Documents Based on Their Metadata

You may want to prioritize documents based on some information in their metadata, such as the number of likes a document got, the date it was created, the author, and so on. This is possible with BM25Retriever, EmbeddingRetriever, and rankers: SentenceTransformersRanker and CohereRanker.

Using BM25Retriever

Retrievers fetch documents from the DeepsetCloudDocumentStore, which, in fact, is OpenSearch. The BM25Retriever uses a default query to query OpenSearch for documents. You can customize this query to prioritize certain documents.

A couple of notes about this method:

  • It's only possible with the BM25Retriever or in hybrid retrieval if the sparse retriever is BM25Retriever.
  • OpenSearch queries are keyword-based, but they can go really deep, taking into account all rankings and dimensions of all documents in the document store.
  • OpenSearch queries execute concurrently with retrieval, which means the most relevant documents are fetched from the document store.

For more guidance and examples of OpenSearch queries, see Boosting Retrieval with OpenSearch Queries.

Using EmbeddingRetriever or Rankers

EmbeddingRetriever, CohereRanker, and SentenceTransformersRanker have a parameter called embed_meta_fields where you can pass the metadata fields you want to prioritize.

This example uses the company metadata field to enhance embedding retrieval:

...
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
      embed_meta_fields:
        [company]
        ...
 

To use more than one metadata field, simply separate them with a comma like this:

params:
	embed_meta_fields:
		[field1, field2]

Prioritizing Recent Documents

In certain document search systems, it’s important to retrieve the most recent documents. For example, when using a document search system that runs on news articles, you’d probably want the latest news to show first.

You can prioritize the latest documents by:

  • Using RecentnessRanker
  • Using a custom OpenSearch query
  • Using both methods in one pipeline

RecentnessRanker

RecentnessRanker takes into account the relevance and recentness of the document. You can prioritize one over the other using the weight parameter.
To prioritize documents based on their recentness, it uses a metadata field containing the date. It then sorts documents by the newest ones first. One thing to note here is that RecentnessRanker ranks documents after the retriever fetches them from the document store. We recommend RecentnessRanker if your pipeline uses EmbeddingRetriever only.

Here’s an example of a pipeline that uses RecentnessRanker:

components:
- name: DocumentStore
  type: DeepsetCloudDocumentStore
- name: EmbeddingRetriever
  type: EmbeddingRetriever
  params:
    document_store: DocumentStore
    embedding_model: PM-AI/bi-encoder_msmarco_bert-base_german
    model_format: sentence_transformers
    top_k: 30
    scale_score: false
    embed_meta_fields:
      [ressort, file_name]
- name: BM25Retriever
  type: BM25Retriever
  params:
    document_store: DocumentStore
    top_k: 30
- name: JoinDocuments
  type: JoinDocuments
  params:
    top_k_join: 30
    join_mode: reciprocal_rank_fusion
- name: Reranker
  type: SentenceTransformersRanker
  params:
    model_name_or_path: svalabs/cross-electra-ms-marco-german-uncased
    top_k: 15
- name: RecentnessReranker
  type: RecentnessReranker
  params:
    date_meta_field: date_first_released
    top_k: 8
    method: score
- name: QueryClassifier
  type: TransformersQueryClassifier
  params:
    model_name_or_path: JasperLS/gelectra-base-injection-pt_v1
    labels: ['LEGIT','INJECTION']
- name: PromptNode
  type: PromptNode
  params:
    default_prompt_template: deepset/question-answering
    max_length: 650
    model_kwargs:
      temperature: 0
    model_name_or_path: gpt-3.5-turbo
- name: FileTypeClassifier
  type: FileTypeClassifier
- name: TextConverter
  type: TextConverter
- name: PDFConverter
  type: PDFToTextConverter
- name: Preprocessor
  params:
    language: de
    split_by: word
    split_length: 150
    split_overlap: 10
    split_respect_sentence_boundary: true
  type: PreProcessor

pipelines:
- name: query
  nodes:
    - name: QueryClassifier
      inputs: [Query]
    - name: EmbeddingRetriever
      inputs: [QueryClassifier.output_1]
    - name: BM25Retriever
      inputs: [QueryClassifier.output_1]
    - name: JoinDocuments
      inputs: [EmbeddingRetriever, BM25Retriever]
    - name: Reranker
      inputs: [JoinDocuments]
    - name: RecentnessReranker
      inputs: [Reranker]
    - name: PromptNode
      inputs: [RecentnessReranker]

- name: indexing
  nodes:
  - inputs:
    - File
    name: FileTypeClassifier
  - inputs:
    - FileTypeClassifier.output_1
    name: TextConverter
  - inputs:
    - FileTypeClassifier.output_2
    name: PDFConverter
  - inputs:
    - TextConverter
    - PDFConverter
    name: Preprocessor
  - inputs:
    - Preprocessor
    name: EmbeddingRetriever
  - inputs:
    - EmbeddingRetriever
    name: DocumentStore

OpenSearch Query

You can construct an OpenSearch query that prioritizes the most recent documents. To use this approach, your app must use BM25Retriever, either on its own or combined with a vector-based retriever. This is because only BM25Retriever supports custom OpenSerach queries.

A big advantage of this approach is that, unlike RecentnessRanker, which ranks the documents after they were retrieved, BM25Retriever uses this query when retrieving documents from the document store. This ensures the documents matching the criteria are retrieved from the database.

For an example of such a query, see Boosting Retrieval with OpenSearch Queries.

Combining Both

You can use both RecentnessRanker and a custom OpenSearch query in your document retrieval system. This way, you ensure that the retriever fetches the most recent documents from the document store, and then the ranker orders them by most recent before returning them as results to the user.

Combining both of these methods makes sense for pipelines using the hybrid retrieval (keyword-based and vector-based retriever) approach. Custom query provides more recent results from BM25Retriever, but the EmbeddingRetriever doesn't consider the date at all. This means that the results returned by JoinDocuments, the node that combines the documents from both retrievers, could be worse than BM25Retriever alone because of the documents retrieved by EmbeddingRetriever would be ranked too high. To overcome this, you can add a RecentnessRanker after the JoinDocuments node to ensure the resulting documents have their recentness taken into account.

Finding the Optimal PreProcessing Configuration

How your documents are preprocessed and prepared for querying influences the efficiency and accuracy of your app. PreProcessor is the component that chunks and cleans up your files during indexing. You can modify its settings so that the resulting documents are clean, structured, and normalized to ensure efficient querying.

Here are the preprocessor settings you can modify to improve your app:

  • split_by - This specifies the unit by which you want to chunk your documents. Splitting by words is the safest option in most cases.
    You can consider splitting by passage if your documents have a clear paragraph structure, with each paragraph describing one idea.
  • split_respect_sentence_boundary - This setting should be True unless your source documents contain a lot of bullet point lists, markdown, or other forms of structured text. In those cases, setting split_respect_sentence_boundary may result in really long documents.
  • split_length - This setting defines how big your documents can be at maximum. For example, if you choose to split your documents by words, split_length defines the maximum number of words your documents can have.
    This value can be bigger for keyword-based retrievers, but for vector-based retrievers, it must fit within the token limit of the retriever’s model. It’s best to check how many tokens the model was trained on and then set a value that fits within this limit. In our pipeline templates, we recommend setting this value to 250 words, which seems to work best in most cases.
    If your pipeline uses hybrid retrieval, you should adjust split_length to the token limit of the vector-based retriever model.