Improving Your Document Search Pipeline
There are multiple ways to improve your document search pipeline, such as changing the retrieval type, choosing a different model, adding a ranker, and many more. This guide explains each of them, helping you choose the one suitable for your use case.
Document search pipelines are the first step of RAG pipelines. It's crucial to ensure they retrieve the correct documents, as these form the basis for the LLM's generated answers.
Changing the Model
Embedders are the pipeline components that turn strings (TextEmbedders) or documents (DocumentEmbedders) into embeddings. TextEmbedders are used before vector retrievers, such as OpenSearchEmbeddingRetriever in query pipelines, while DocumentEmbedders are used in indexes. If your pipeline uses a vector Retriever with an Embedder, you can first try to improve it by changing the Embedder's model. Check the models for retrieval we recommend and try one.
It's important to know that the DocumentEmbedder in your index and the TextEmbedder in your query pipeline must use the same model.
Using the Hybrid Retrieval Approach
With HybridRetriever
You can use OpenSearchHybridRetriever to combine a vector and a keyword Retriever. It's the easiest way to achieve hubrid retrieval. Keyword retrievers can handle out-of-domain vocabulary and don’t need any training but can't capture semantic nuances. Semantic retrievers, on the other hand, can understand the context and semantics, but perform best in the domain they were trained on. Combining them allows you to take advantage of their strengths, improving your system's efficiency. OpenSearchHybridRetriever combines the two retrieval methods. It has a built-in text embedder, so you just need to connect it to Input and a Ranker or other component that consumes the retrieved documents.
For details, see OpenSearchHybridRetriever.
With Two Retrievers
You can manually combine a vector and a keyword Retriever.
This is an example of a hybrid retrieval pipeline that uses the OpenSearchBM25Retriever and the OpenSearchEmbeddingRetriever and then merges the documents they fetch using the DocumentJoiner component. DocumentJoiner removes duplicates and leaves only unique documents.
# haystack-pipeline
components:
bm25_retriever:
# Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
index: ""
max_chunk_bytes: 104857600
return_embedding: false
settings:
index.knn: true
create_index: true
top_k: 20 # The number of results to return
fuzziness: 0
query_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: intfloat/e5-base-v2
device: null
embedding_retriever:
# Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
index: ""
max_chunk_bytes: 104857600
return_embedding: false
settings:
index.knn: true
create_index: true
top_k: 20 # The number of results to return
ranker:
type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
init_parameters:
model: intfloat/simlm-msmarco-reranker
top_k: 20
device: null
connections:
- sender: bm25_retriever.documents
receiver: ranker.documents
- sender: query_embedder.embedding
receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
receiver: ranker.documents
inputs:
query:
- bm25_retriever.query
- query_embedder.text
- ranker.query
filters:
- bm25_retriever.filters
- embedding_retriever.filters
outputs:
documents: ranker.documents
Adding a Ranker
Rankers determine the relevance and order of documents in response to a query. Their primary goal is to present the most relevant documents at the top of the search results. While working with our customers, we found that rankers are better at determining the relevance of documents than retrievers. Adding a ranker can significantly improve the results.
For ranking documents by relevance, we recommend using CohereRanker or DeepsetNvidiaRanker. Also, check the ranking models we recommend.
Here's an example hybrid retrieval pipeline with a TransformersSimilarityRanker:
## haystack-pipeline
components:
bm25_retriever:
# Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
index: Standard-Index-English
max_chunk_bytes: 104857600
return_embedding: false
settings:
index.knn: true
create_index: true
top_k: 20 # The number of results to return
fuzziness: 0
query_embedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: intfloat/e5-base-v2
device: null
embedding_retriever:
# Selects the most similar documents from the document store
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
index: Standard-Index-English
max_chunk_bytes: 104857600
return_embedding: false
settings:
index.knn: true
create_index: true
top_k: 20 # The number of results to return
ranker:
type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
init_parameters:
model: intfloat/simlm-msmarco-reranker
top_k: 20
device: null
LLM:
type: haystack.components.generators.chat.llm.LLM
init_parameters:
chat_generator:
init_parameters:
model: gpt-5.5
type: haystack.components.generators.chat.openai_responses.OpenAIResponsesChatGenerator
system_prompt:
user_prompt: >-
{% message role="user" %}
You are a technical expert.
You answer questions truthfully based on provided documents.
Ignore typing errors in the question.
For each document check whether it is related to the question.
Only use documents that are related to the question to answer it.
Ignore documents that are not related to the question.
If the answer exists in several documents, summarize them.
Only answer based on the documents provided. Don't make things up.
Just output the structured, informative and precise answer and nothing else.
If the documents can't answer the question, say so.
Always use references in the form [NUMBER OF DOCUMENT] when using information from a document, e.g. [3] for Document[3].
Never name the documents, only enter a number in square brackets as a reference.
The reference must only refer to the number that comes in square brackets after the document.
Otherwise, do not use brackets in your answer and reference ONLY the number of the document without mentioning the word document.
These are the documents:
{% for document in documents %}
Document[{{ loop.index }}]:
{{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
{% endmessage %}
required_variables: "*"
streaming_callback:
connections:
- sender: bm25_retriever.documents
receiver: ranker.documents
- sender: query_embedder.embedding
receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
receiver: ranker.documents
- sender: ranker.documents
receiver: LLM.documents
inputs:
query:
- bm25_retriever.query
- query_embedder.text
- ranker.query
- LLM.question
filters:
- bm25_retriever.filters
- embedding_retriever.filters
outputs:
documents: ranker.documents
messages: LLM.messages
Prioritizing Documents Based on Their Metadata
You can use metadata to customize how your documents are retrieved. For details, Metadata for Ranking.
Finding the Optimal Preprocessing Configuration
How your documents are preprocessed and prepared for querying influences the efficiency and accuracy of your app. DocumentSplitter is the component that chunks your files during indexing. You can modify its settings so that the resulting documents are clean, structured, and normalized to ensure efficient querying.
Here are the DocumentSplitter's settings you can modify to improve your app:
split_by- This specifies the unit by which you want to chunk your documents. Splitting by words is the safest option in most cases.
You can consider splitting bypassageif your documents have a clear paragraph structure, with each paragraph describing one idea.split_length- This setting defines how big your documents can be at maximum. For example, if you choose to split your documents by words,split_lengthdefines the maximum number of words your documents can have.
This value can be bigger for keyword-based retrievers, but for vector-based retrievers, it must fit within the token limit of the retriever’s model. It’s best to check how many tokens the model was trained on and then set a value that fits within this limit. In our pipeline templates, we recommend setting this value to250words, which seems to work best in most cases.
If your pipeline uses hybrid retrieval, you should adjustsplit_lengthto the token limit of the vector-based retriever model.
Was this page helpful?