Skip to main content

RecursiveDocumentSplitter

Recursively chunk text into smaller pieces using a list of separators.

Basic Information

  • Type: haystack.components.preprocessors.recursive_splitter.RecursiveDocumentSplitter
  • Components it can connect with:
    • Converters: RecursiveDocumentSplitter receives documents from converters or DocumentCleaner.
    • DocumentCleaner: RecursiveDocumentSplitter can receive cleaned documents from DocumentCleaner.
    • Embedders: RecursiveDocumentSplitter sends split documents to document embedders.
    • DocumentWriter: RecursiveDocumentSplitter can send documents to DocumentWriter for storage.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents to split.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents with smaller chunks of text.

Overview

RecursiveDocumentSplitter splits text into smaller chunks by recursively applying a list of separators. This approach creates more semantically meaningful chunks compared to simple fixed-size splitting.

The component applies the separators in order, typically from most general to most specific. For each separator:

  1. The text is split by that separator.
  2. Chunks within the split_length are kept.
  3. Chunks larger than split_length are split again using the next separator.

This continues until all chunks are smaller than the split_length parameter.

The default separators split by paragraph, then sentence, then line, then word, ensuring chunks respect natural text boundaries.

Usage Example

Using the Component in an Index

This example shows an index using recursive splitting with custom separators.

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
RecursiveDocumentSplitter:
type: haystack.components.preprocessors.recursive_splitter.RecursiveDocumentSplitter
init_parameters:
split_length: 200
split_overlap: 20
split_unit: word
separators:
- "\n\n"
- sentence
- "\n"
- " "
sentence_splitter_params:
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
batch_size: 32
progress_bar: true
normalize_embeddings: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: RecursiveDocumentSplitter.documents
- sender: RecursiveDocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- TextFileToDocument.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
split_lengthint200The maximum length of each chunk by default in words, but can be in characters or tokens. See the split_units parameter.
split_overlapint0The number of characters to overlap between consecutive chunks.
split_unitLiteral['word', 'char', 'token']wordThe unit of the split_length parameter. It can be either "word", "char", or "token". If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base).
separatorsOptional[List[str]]NoneAn optional list of separator strings to use for splitting the text. The string separators will be treated as regular expressions unless the separator is "sentence", in that case the text will be split into sentences using a custom sentence tokenizer based on NLTK. If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used.
sentence_splitter_paramsOptional[Dict[str, Any]]NoneOptional parameters to pass to the sentence tokenizer. .

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to split.