RecursiveDocumentSplitter
Recursively chunk text into smaller pieces using a list of separators. This approach creates more semantically meaningful chunks compared to fixed-size splitting.
Key Features
- Splits text recursively using an ordered list of separators, from most general to most specific.
- Keeps chunks within the configured size limit, and re-splits larger chunks using the next separator.
- Supports splitting by word, character, or token count.
- Includes built-in sentence-aware splitting using NLTK.
- Uses default separators (paragraph, sentence, line, word) if none are specified.
Configuration
- Drag the
RecursiveDocumentSplittercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Set Split Length to the maximum chunk size.
- Set Split Overlap to the number of units to overlap between consecutive chunks.
- Choose a Split Unit:
word,char, ortoken. - Set Separators to define an ordered list of separator strings. Use
"sentence"as a separator value to enable sentence-aware splitting. - Set Sentence Splitter Params to pass optional parameters to the NLTK sentence tokenizer.
Connections
RecursiveDocumentSplitter accepts a list of Document objects and outputs a list of Document objects with smaller text chunks.
It typically receives documents from converters or DocumentCleaner, and sends split documents to embedders or DocumentWriter.
Source Code
To check this component's source code, open recursive_splitter.py in the Haystack repository.
Usage Examples
Basic Configuration
RecursiveDocumentSplitter:
type: haystack.components.preprocessors.recursive_splitter.RecursiveDocumentSplitter
init_parameters:
split_length: 200
split_overlap: 20
split_unit: word
separators:
- "\n\n"
- sentence
- "\n"
- ' '
Using the Component in an Index
This example shows an index using recursive splitting with custom separators.
# haystack-pipeline
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
RecursiveDocumentSplitter:
type: haystack.components.preprocessors.recursive_splitter.RecursiveDocumentSplitter
init_parameters:
split_length: 200
split_overlap: 20
split_unit: word
separators:
- "\n\n"
- sentence
- "\n"
- " "
sentence_splitter_params:
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
progress_bar: true
normalize_embeddings: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: RecursiveDocumentSplitter.documents
- sender: RecursiveDocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of documents to split. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of documents with smaller chunks of text. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
split_length | int | 200 | The maximum length of each chunk by default in words, but can be in characters or tokens. See the split_units parameter. |
split_overlap | int | 0 | The number of characters to overlap between consecutive chunks. |
split_unit | Literal['word', 'char', 'token'] | word | The unit of the split_length parameter. It can be either "word", "char", or "token". If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). |
separators | Optional[List[str]] | None | An optional list of separator strings to use for splitting the text. The string separators will be treated as regular expressions unless the separator is "sentence", in that case the text will be split into sentences using a custom sentence tokenizer based on NLTK. If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used. |
sentence_splitter_params | Optional[Dict[str, Any]] | None | Optional parameters to pass to the sentence tokenizer. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of Documents to split. |
Related Information
Was this page helpful?