Skip to main content

HierarchicalDocumentSplitter

Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.

Basic Information

  • Type: haystack_integrations.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to split into hierarchical blocks.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of HierarchicalDocument

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.

The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between are connected such that the smaller blocks are children of the parent-larger blocks.

Usage Example

components:
HierarchicalDocumentSplitter:
type: components.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
block_sizesSet[int]Set of block sizes to split the document into. The blocks are split in descending order.
split_overlapint0The number of overlapping units for each split.
split_byLiteral['word', 'sentence', 'page', 'passage']wordThe unit for splitting your documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to split into hierarchical blocks.