Skip to main content

AutoMergingRetriever

Improve search results by returning complete parent documents instead of fragmented chunks when multiple related pieces match a query.

Basic Information

  • Type: haystack.components.retrievers.auto_merging_retriever.AutoMergingRetriever
  • Components it can connect with:
    • Retrievers: AutoMergingRetriever can receive documents from any retriever that returns hierarchical documents.
    • PromptBuilder, ChatPromptBuilder, AnswerBuilder, or Ranker: AutoMergingRetriever can send documents to these components to be used in the prompt, answer, or ranking process.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of leaf documents that were matched by a retriever

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents (could be a mix of different hierarchy levels)

Overview

AutoMergingRetriever works with a hierarchical document structure to return parent documents instead of individual chunked documents when the number of matched leaf documents exceeds a certain threshold. This is particularly useful when working with paragraphs split into multiple chunks: when several chunks from the same paragraph match your query, the complete paragraph often provides more context and value than the individual pieces alone.

Here's how this Retriever works:

  1. It requires documents to be organized in a tree structure. For information on how to create this structure, see HierarchicalDocumentSplitter documentation for how to create this structure.
  2. When searching, it counts how many chunked documents under the same parent match your query.
  3. If this count exceeds your defined threshold, it returns the parent document instead of the individual chunks.

For example, if a parent document has three child chunks, and you set threshold=0.5, the retriever returns the parent document when at least two of the three chunks (2/3 = 0.66, which is > 0.5) are retrieved.

You can use AutoMergingRetriever with the following Document Stores:

Usage Example

This example shows a RAG pipeline that first retrieves leaf-level document chunks using BM25, merges them into higher-level parent documents with AutoMergingRetriever, constructs a prompt, and generates an answer:

components:
bm25_retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
index: leaf_documents
top_k: 10

auto_merging_retriever:
type: haystack.components.retrievers.auto_merging_retriever.AutoMergingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
index: parent_documents
threshold: 0.6

chat_prompt_builder:
type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
init_parameters:
template:
- _content:
- text: "You are a helpful assistant."
_role: system
- _content:
- text: "Given these documents, answer the question.\nDocuments:\n{% for doc in documents %}{{ doc.content }}{% endfor %}\nQuestion: {{question}}\nAnswer:"
_role: user

llm:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
model: gpt-5-mini
generation_kwargs:
temperature: 0.7

answer_builder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters: {}

connections:
- sender: bm25_retriever.documents
receiver: auto_merging_retriever.documents
- sender: auto_merging_retriever.documents
receiver: chat_prompt_builder.documents
- sender: chat_prompt_builder.prompt
receiver: llm.messages
- sender: llm.replies
receiver: answer_builder.replies
- sender: auto_merging_retriever.documents
receiver: answer_builder.documents

max_runs_per_component: 100

inputs:
query:
- bm25_retriever.query
- chat_prompt_builder.question
- answer_builder.query

outputs:
answers: answer_builder.answers

metadata: {}
info

Before using this pipeline, index your documents using HierarchicalDocumentSplitter to create the hierarchical structure. Leaf documents should be indexed in one document store (for example, leaf_documents), and parent documents in another (for example, parent_documents).

Parameters

Init parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
document_storeDocumentStoreDocumentStore from which to retrieve the parent documents
thresholdfloat0.5Threshold to decide whether the parent instead of the individual documents is returned

Run method parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of leaf documents that were matched by a retriever