Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

HierarchicalDocumentSplitter

Split documents into different block sizes, building a hierarchical tree structure for advanced retrieval.

Key Features

  • Splits documents into multiple levels of chunk sizes in a single pass.
  • Builds a parent-child hierarchy where smaller chunks are children of larger parent chunks.
  • Enables auto-merging retrieval: retrieve specific small chunks and expand to parent context when needed.
  • Configurable overlap between chunks at each level.
  • Supports splitting by word, sentence, page, or passage.

Configuration

  1. Drag the HierarchicalDocumentSplitter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. On the General tab:
    1. Set block_sizes — a list of chunk sizes for different hierarchy levels, from largest to smallest (for example, [512, 256, 128]).
  4. Go to the Advanced tab to configure the split overlap and split unit (split_overlap_unit, split_by).

Connections

HierarchicalDocumentSplitter accepts a list of documents (documents) as input and outputs a list of hierarchical documents (documents) with parent-child relationships.

Connect converters or DocumentCleaner to the input. Connect the output to document embedders and then DocumentWriter for storage. In query pipelines, use AutoMergingRetriever to merge child documents back to their parent context.

Usage Example

Using the Component in an Index

This example shows an indexing pipeline using hierarchical splitting with block sizes of 512, 256, and 128 words.

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
HierarchicalDocumentSplitter:
type: deepset_cloud_custom_nodes.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:
block_sizes:
- 512
- 256
- 128
split_overlap: 0
split_by: word
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
progress_bar: true
normalize_embeddings: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: hierarchical-documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: HierarchicalDocumentSplitter.documents
- sender: HierarchicalDocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- TextFileToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to split into hierarchical blocks.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of hierarchical documents with parent-child relationships.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
block_sizesSet[int]Set of block sizes to split the document into. The blocks are split in descending order.
split_overlapint0The number of overlapping units for each split.
split_byLiteral['word', 'sentence', 'page', 'passage']wordThe unit for splitting your documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to split into hierarchical blocks.