Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentSplitter

Split long documents into smaller chunks to prepare data for search indexing.

Key Features

  • Splits documents by word, sentence, page, passage, line, or a custom function.
  • Configurable chunk size (split_length) and overlap (split_overlap) between consecutive chunks.
  • Merges chunks smaller than a configurable threshold with the previous chunk.
  • Respects sentence boundaries when splitting by word to avoid breaking mid-sentence.
  • Supports custom splitting functions for advanced use cases.
  • Tracks source_id, page_number, and split_id in split document metadata.

Configuration

  1. Drag the DocumentSplitter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

DocumentSplitter accepts a list of documents (documents) as input and outputs a list of split documents (documents). Each output document includes source_id and page_number metadata fields.

Connect converters or DocumentCleaner to the input. Connect the output to document embedders such as SentenceTransformersDocumentEmbedder, or directly to DocumentWriter for storage.

Usage Example

Using the Component in an Index

This example shows a typical index where DocumentSplitter chunks documents after cleaning and before embedding.

components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 20
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
trust_remote_code: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE

connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- TextFileToDocument.sources

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]The documents to split.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of documents with split texts. Each document includes source_id and page_number metadata fields.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
split_byLiteral['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']wordThe unit for splitting your documents. Choose from: - word for splitting by spaces (" ") - period for splitting by periods (".") - page for splitting by form feed ("\f") - passage for splitting by double line breaks ("\n\n") - line for splitting each line ("\n") - sentence for splitting by NLTK sentence tokenizer
split_lengthint200The maximum number of units in each split.
split_overlapint0The number of overlapping units for each split.
split_thresholdint0The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.
splitting_functionOptional[Callable[[str], List[str]]]NoneNecessary when split_by is set to "function". This is a function which must accept a single str as input and return a list of str as output, representing the chunks after splitting.
respect_sentence_boundaryboolFalseChoose whether to respect sentence boundaries when splitting by "word". If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
languageLanguageenChoose the language for the NLTK tokenizer. The default is English ("en").
use_split_rulesboolTrueChoose whether to use additional split rules when splitting by sentence.
extend_abbreviationsboolTrueChoose whether to extend NLTK's PunktTokenizer abbreviations with a list of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
skip_empty_documentsboolTrueChoose whether to skip documents with empty content. Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text from non-textual documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]The documents to split.