Improving Your Document Search Pipeline

There are multiple ways to improve your document search pipeline, such as changing the retrieval type, choosing a different model, adding a ranker, and many more. This guide explains each of them, helping you choose the one suitable for your use case.

Document search pipelines are the first step of RAG pipelines. It's crucial to ensure they retrieve the correct documents, as these form the basis for the LLM's generated answers.

Changing the Model

Embedders are the pipeline components that turn strings (TextEmbedders) or documents (DocumentEmbedders) into embeddings. TextEmbedders are used before vector retrievers, such as OpenSearchEmbeddingRetriever in query pipelines, while DocumentEmbedders are used in indexing pipelines. If your pipeline uses a vector retriever with an embedder, you can first try to improve it by changing the Embedder's model. Check the models for retrieval we recommend and try one.

It's important to know that the DocumentEmbedder in your indexing pipeline and the TextEmbedder in your query pipeline must use the same model.

Using the Hybrid Retrieval Approach

Another improvement you can try is combining a vector and a keyword retriever. Keyword retrievers, such as OpenSearchBM25Retriever, can handle out-of-domain vocabulary and don’t need any training but are not great at capturing semantic nuances.

Vector retrievers, like OpenSearchEmbeddingRetriever, can understand the context and semantics but perform best in the domain they were trained on. By using both retriever types in one pipeline, you take advantage of their strengths, improving your system's efficiency.

This is an example of a hybrid retrieval pipeline that uses the OpenSearchBM25Retriever and the OpenSearchEmbeddingRetriever and then merges the documents they fetch using the DocumentJoiner component. DocumentJoiner removes duplicates and leaves only unique documents.

components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
      top_k: 20 # The number of results to return

  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: "intfloat/e5-base-v2"

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: "intfloat/simlm-msmarco-reranker"
      top_k: 8
      model_kwargs:
        torch_dtype: "torch.float16"

  prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are a technical expert.
        You answer questions truthfully based on provided documents.
        Ignore typing errors in the question.
        For each document check whether it is related to the question.
        Only use documents that are related to the question to answer it.
        Ignore documents that are not related to the question.
        If the answer exists in several documents, summarize them.
        Only answer based on the documents provided. Don't make things up.
        Just output the structured, informative and precise answer and nothing else.
        If the documents can't answer the question, say so.
        Always use references in the form [NUMBER OF DOCUMENT] when using information from a document, e.g. [3] for Document[3].
        Never name the documents, only enter a number in square brackets as a reference.
        The reference must only refer to the number that comes in square brackets after the document.
        Otherwise, do not use brackets in your answer and reference ONLY the number of the document without mentioning the word document.
        These are the documents:
        {% for document in documents %}
        Document[{{ loop.index }}]:
        {{ document.content }}
        {% endfor %}

        Question: {{ question }}
        Answer:

  llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 650
        temperature: 0.0
        seed: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

connections:  # Defines how the components are connected
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: ranker.documents
  receiver: prompt_builder.documents
- sender: ranker.documents
  receiver: answer_builder.documents
- sender: prompt_builder.prompt
  receiver: llm.prompt
- sender: prompt_builder.prompt
  receiver: answer_builder.prompt
- sender: llm.replies
  receiver: answer_builder.replies

max_loops_allowed: 100

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

Adding a Ranker

Rankers determine the relevance and order of documents in response to a query. Their primary goal is to present the most relevant documents at the top of the search results. While working with our customers, we found that rankers are better at determining the relevance of documents than retrievers. Adding a ranker can significantly improve the results.

For ranking documents by relevance, we recommend using CohereRanker or TransformersSimilarityRanker. Also, check the ranking models we recommend.

Here's an example hybrid retrieval pipeline with a TransformersSimilarityRanker:


components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
      top_k: 20 # The number of results to return

  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: "intfloat/e5-base-v2"

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 768
          similarity: cosine
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: "intfloat/simlm-msmarco-reranker"
      top_k: 8
      model_kwargs:
        torch_dtype: "torch.float16"

  prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are a technical expert.
        You answer questions truthfully based on provided documents.
        Ignore typing errors in the question.
        For each document check whether it is related to the question.
        Only use documents that are related to the question to answer it.
        Ignore documents that are not related to the question.
        If the answer exists in several documents, summarize them.
        Only answer based on the documents provided. Don't make things up.
        Just output the structured, informative and precise answer and nothing else.
        If the documents can't answer the question, say so.
        Always use references in the form [NUMBER OF DOCUMENT] when using information from a document, e.g. [3] for Document[3].
        Never name the documents, only enter a number in square brackets as a reference.
        The reference must only refer to the number that comes in square brackets after the document.
        Otherwise, do not use brackets in your answer and reference ONLY the number of the document without mentioning the word document.
        These are the documents:
        {% for document in documents %}
        Document[{{ loop.index }}]:
        {{ document.content }}
        {% endfor %}

        Question: {{ question }}
        Answer:

  llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 650
        temperature: 0.0
        seed: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

connections:  # Defines how the components are connected
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: ranker.documents
  receiver: prompt_builder.documents
- sender: ranker.documents
  receiver: answer_builder.documents
- sender: prompt_builder.prompt
  receiver: llm.prompt
- sender: prompt_builder.prompt
  receiver: answer_builder.prompt
- sender: llm.replies
  receiver: answer_builder.replies

max_loops_allowed: 100

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

Prioritizing Documents Based on Their Metadata

You can use metadata to customize how your documents are retrieved. For details, see Metadata to Customize Retrieval and Metadata for Ranking.

Finding the Optimal PreProcessing Configuration

How your documents are preprocessed and prepared for querying influences the efficiency and accuracy of your app. DocumentSplitter is the component that chunks your files during indexing. You can modify its settings so that the resulting documents are clean, structured, and normalized to ensure efficient querying.

Here are the DocumentSplitter's settings you can modify to improve your app:

  • split_by - This specifies the unit by which you want to chunk your documents. Splitting by words is the safest option in most cases.
    You can consider splitting by passage if your documents have a clear paragraph structure, with each paragraph describing one idea.
  • split_length - This setting defines how big your documents can be at maximum. For example, if you choose to split your documents by words, split_length defines the maximum number of words your documents can have.
    This value can be bigger for keyword-based retrievers, but for vector-based retrievers, it must fit within the token limit of the retriever’s model. It’s best to check how many tokens the model was trained on and then set a value that fits within this limit. In our pipeline templates, we recommend setting this value to 250 words, which seems to work best in most cases.
    If your pipeline uses hybrid retrieval, you should adjust split_length to the token limit of the vector-based retriever model.