HierarchicalDocumentSplitter
Split documents into different block sizes, building a hierarchical tree structure.
Basic Information
- Type:
deepset_cloud_custom_nodes.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter - Components it can connect with:
- Converters:
HierarchicalDocumentSplitterreceives documents from converters orDocumentCleaner. DocumentCleaner:HierarchicalDocumentSplittercan receive cleaned documents fromDocumentCleaner.- Embedders:
HierarchicalDocumentSplittersends split documents to document embedders. DocumentWriter:HierarchicalDocumentSplittercan send documents toDocumentWriterfor storage.AutoMergingRetriever: Use withAutoMergingRetrieverin query pipelines to merge child documents back to parent context.
- Converters:
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split into hierarchical blocks. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of hierarchical documents with parent-child relationships. |
Overview
HierarchicalDocumentSplitter splits documents into blocks of different sizes, building a hierarchical tree structure. The root node of the tree is the original document, and the leaf nodes are the smallest blocks. The blocks in between are connected such that smaller blocks are children of parent (larger) blocks.
This hierarchical structure is useful for advanced retrieval techniques like auto-merging retrieval, where you retrieve small, specific chunks but can expand to larger parent chunks for more context when needed.
Key parameters:
block_sizes: Set of block sizes to split the document into (splits in descending order)split_overlap: Number of overlapping units between chunkssplit_by: The unit for splitting -word,sentence,page, orpassage
Usage Example
Using the Component in an Index
This example shows an indexing pipeline using hierarchical splitting with block sizes of 512, 256, and 128 words.
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
HierarchicalDocumentSplitter:
type: deepset_cloud_custom_nodes.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:
block_sizes:
- 512
- 256
- 128
split_overlap: 0
split_by: word
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
progress_bar: true
normalize_embeddings: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: hierarchical-documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: HierarchicalDocumentSplitter.documents
- sender: HierarchicalDocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| block_sizes | Set[int] | Set of block sizes to split the document into. The blocks are split in descending order. | |
| split_overlap | int | 0 | The number of overlapping units for each split. |
| split_by | Literal['word', 'sentence', 'page', 'passage'] | word | The unit for splitting your documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split into hierarchical blocks. |
Was this page helpful?