HierarchicalDocumentSplitter
Split documents into different block sizes, building a hierarchical tree structure. Use this component in indexes to enable advanced auto-merging retrieval, where you retrieve small, specific chunks and can expand to larger parent chunks for more context.
Key Features
- Splits documents into multiple block sizes, producing a hierarchy from large to small chunks.
- Builds a parent-child relationship between chunks, where smaller chunks are children of larger ones.
- Supports splitting by word, sentence, page, or passage.
- Supports overlapping splits to preserve context across chunks.
- Works with
AutoMergingRetrieverin query pipelines to merge child documents back into parent context at query time.
Configuration
- Drag the
HierarchicalDocumentSplittercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set Block Sizes to define the set of chunk sizes to split the document into. For example,
[512, 256, 128]creates three levels of chunks in descending order.
- Set Block Sizes to define the set of chunk sizes to split the document into. For example,
- Go to the Advanced tab to configure additional settings:
- Choose a Split By unit:
word,sentence,page, orpassage. - Set Split Overlap to the number of overlapping units between consecutive chunks.
- Choose a Split By unit:
Connections
HierarchicalDocumentSplitter accepts a list of Document objects and outputs a list of hierarchical Document objects with parent-child relationships.
It typically receives documents from converters or DocumentCleaner, and sends split documents to embedders or DocumentWriter. Use it together with AutoMergingRetriever in query pipelines to leverage the hierarchical structure.
Source Code
To check this component's source code, open hierarchical_document_splitter.py in the Haystack repository.
Usage Examples
Basic Configuration
HierarchicalDocumentSplitter:
type: haystack.components.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:
block_sizes:
- 512
- 256
- 128
split_overlap: 0
split_by: word
Using the Component in an Index
This example shows an indexing pipeline using hierarchical splitting with block sizes of 512, 256, and 128 words.
# haystack-pipeline
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
HierarchicalDocumentSplitter:
type: haystack.components.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:
block_sizes:
- 512
- 256
- 128
split_overlap: 0
split_by: word
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
progress_bar: true
normalize_embeddings: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: hierarchical-documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: HierarchicalDocumentSplitter.documents
- sender: HierarchicalDocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of Documents to split into hierarchical blocks. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of hierarchical documents with parent-child relationships. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
block_sizes | Set[int] | Set of block sizes to split the document into. The blocks are split in descending order. | |
split_overlap | int | 0 | The number of overlapping units for each split. |
split_by | Literal['word', 'sentence', 'page', 'passage'] | word | The unit for splitting your documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of Documents to split into hierarchical blocks. |
Related Information
Was this page helpful?