Skip to main content

AutoMergingRetriever

A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.

Basic Information

  • Type: haystack_integrations.retrievers.auto_merging_retriever.AutoMergingRetriever

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of leaf documents that were matched by a retriever

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents (could be a mix of different hierarchy levels)

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

A retriever which returns parent documents of the matched leaf nodes documents, based on a threshold setting.

The AutoMergingRetriever assumes you have a hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. See the HierarchicalDocumentSplitter for more information on how to create such a structure. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents.

The rational is, given that a paragraph is split into multiple chunks represented as leaf documents, and if for a given query, multiple chunks are matched, the whole paragraph might be more informative than the individual chunks alone.

Currently the AutoMergingRetriever can only be used by the following DocumentStores:

from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]

# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
if doc.meta["children_ids"] and doc.meta["level"] == 1:
doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)

# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}

Usage Example

components:
AutoMergingRetriever:
type: components.retrievers.auto_merging_retriever.AutoMergingRetriever
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
document_storeDocumentStoreDocumentStore from which to retrieve the parent documents
thresholdfloat0.5Threshold to decide whether the parent instead of the individual documents is returned

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of leaf documents that were matched by a retriever