AutoMergingRetriever
Improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query.
Basic Information
- Type:
haystack.components.retrievers.auto_merging_retriever.AutoMergingRetriever - Components it can connect with:
- Retrievers:
AutoMergingRetrievercan receive documents from any retriever that returns hierarchical documents. PromptBuilder,ChatPromptBuilder,AnswerBuilder, orRanker:AutoMergingRetrievercan send documents to these components to be used in the prompt, answer, or ranking process.
- Retrievers:
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of leaf documents that were matched by a retriever |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of documents (could be a mix of different hierarchy levels) |
Overview
AutoMergingRetriever works with a hierarchical document structure to return parent documents instead of individual chunked documents when the number of matched leaf documents exceeds a certain threshold. This is particularly useful when working with paragraphs split into multiple chunks: when several chunks from the same paragraph match your query, the complete paragraph often provides more context and value than the individual pieces alone.
Here's how this Retriever works:
- It requires documents to be organized in a tree structure. For information on how to create this structure, see
HierarchicalDocumentSplitterdocumentation for how to create this structure. - When searching, it counts how many chunked documents under the same parent match your query.
- If this count exceeds your defined threshold, it returns the parent document instead of the individual chunks.
For example, if a parent document has three child chunks, and you set threshold=0.5, the retriever returns the parent document when at least two of the three chunks (2/3 = 0.66, which is > 0.5) are retrieved.
You can use AutoMergingRetriever with the following Document Stores:
Usage Example
This example shows a RAG pipeline that first retrieves leaf-level document chunks using BM25, merges them into higher-level parent documents with AutoMergingRetriever, constructs a prompt, and generates an answer:
components:
bm25_retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
index: leaf_documents
top_k: 10
auto_merging_retriever:
type: haystack.components.retrievers.auto_merging_retriever.AutoMergingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
index: parent_documents
threshold: 0.6
chat_prompt_builder:
type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
init_parameters:
template:
- _content:
- text: "You are a helpful assistant."
_role: system
- _content:
- text: "Given these documents, answer the question.\nDocuments:\n{% for doc in documents %}{{ doc.content }}{% endfor %}\nQuestion: {{question}}\nAnswer:"
_role: user
llm:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
model: gpt-5-mini
generation_kwargs:
temperature: 0.7
answer_builder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters: {}
connections:
- sender: bm25_retriever.documents
receiver: auto_merging_retriever.documents
- sender: auto_merging_retriever.documents
receiver: chat_prompt_builder.documents
- sender: chat_prompt_builder.prompt
receiver: llm.messages
- sender: llm.replies
receiver: answer_builder.replies
- sender: auto_merging_retriever.documents
receiver: answer_builder.documents
max_runs_per_component: 100
inputs:
query:
- bm25_retriever.query
- chat_prompt_builder.question
- answer_builder.query
outputs:
answers: answer_builder.answers
metadata: {}
Before using this pipeline, index your documents using HierarchicalDocumentSplitter to create the hierarchical structure. Leaf documents should be indexed in one document store (for example, leaf_documents), and parent documents in another (for example, parent_documents).
Parameters
Init parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_store | DocumentStore | DocumentStore from which to retrieve the parent documents | |
| threshold | float | 0.5 | Threshold to decide whether the parent instead of the individual documents is returned |
Run method parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of leaf documents that were matched by a retriever |
Was this page helpful?