HierarchicalDocumentSplitter
Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
Basic Information
- Type:
haystack_integrations.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split into hierarchical blocks. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of HierarchicalDocument |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between are connected such that the smaller blocks are children of the parent-larger blocks.
Usage Example
components:
HierarchicalDocumentSplitter:
type: components.preprocessors.hierarchical_document_splitter.HierarchicalDocumentSplitter
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| block_sizes | Set[int] | Set of block sizes to split the document into. The blocks are split in descending order. | |
| split_overlap | int | 0 | The number of overlapping units for each split. |
| split_by | Literal['word', 'sentence', 'page', 'passage'] | word | The unit for splitting your documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split into hierarchical blocks. |
Was this page helpful?