DocumentSplitter
Split long documents into smaller chunks. Use this component in your indexes to prepare data for search.
Key Features
- Splits documents by word, sentence, passage, page, line, or a custom function.
- Supports configurable chunk size and overlap between chunks.
- Optionally respects sentence boundaries when splitting by word.
- Adds metadata to each split, including source document ID, page number, and split order.
- Merges very small splits with the previous chunk using a configurable threshold.
Configuration
- Drag the
DocumentSplittercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Choose a Split By unit:
word,sentence,passage,page,line, orfunction. - Set Split Length to the maximum number of units in each chunk.
- Set Split Overlap to the number of overlapping units between consecutive chunks.
- Set Split Threshold to define the minimum number of units per chunk. Chunks smaller than this value are merged with the previous chunk.
- Set Splitting Function if you chose
functionas the split unit. The function must accept a string and return a list of strings. - Toggle Respect Sentence Boundary to avoid splitting in the middle of a sentence when splitting by word.
- Set Language for the NLTK sentence tokenizer (default:
en). - Toggle Use Split Rules to apply additional splitting heuristics for sentence splitting.
- Toggle Extend Abbreviations to improve sentence splitting accuracy for English and German.
- Toggle Skip Empty Documents to skip documents with empty content.
- Choose a Split By unit:
Connections
DocumentSplitter accepts a list of Document objects and outputs a list of split Document objects. Each output document includes source_id and page_number metadata fields.
It typically receives documents from converters or DocumentCleaner, and sends split documents to embedders like SentenceTransformersDocumentEmbedder or to DocumentWriter.
Source Code
To check this component's source code, open document_splitter.py in the Haystack repository.
Usage Examples
Basic Configuration
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 20
split_threshold: 0
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
Using the Component in an Index
This example shows a typical index where DocumentSplitter chunks documents after cleaning and before embedding.
# haystack-pipeline
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
DocumentCleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 20
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
trust_remote_code: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: documents-index
max_chunk_bytes: 104857600
embedding_dim: 384
return_embedding: false
create_index: true
similarity: cosine
policy: NONE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | The documents to split. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | List of documents with split texts. Each document includes source_id and page_number metadata fields. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
split_by | Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence'] | word | The unit for splitting your documents. Choose from: word for splitting by spaces (" "), period for splitting by periods ("."), page for splitting by form feed ("\f"), passage for splitting by double line breaks ("\n\n"), line for splitting each line ("\n"), or sentence for splitting by NLTK sentence tokenizer. |
split_length | int | 200 | The maximum number of units in each split. |
split_overlap | int | 0 | The number of overlapping units for each split. |
split_threshold | int | 0 | The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split. |
splitting_function | Optional[Callable[[str], List[str]]] | None | Necessary when split_by is set to "function". This is a function which must accept a single str as input and return a list of str as output, representing the chunks after splitting. |
respect_sentence_boundary | bool | False | Choose whether to respect sentence boundaries when splitting by "word". If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. |
language | Language | en | Choose the language for the NLTK tokenizer. The default is English ("en"). |
use_split_rules | bool | True | Choose whether to use additional split rules when splitting by sentence. |
extend_abbreviations | bool | True | Choose whether to extend NLTK's PunktTokenizer abbreviations with a list of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). |
skip_empty_documents | bool | True | Choose whether to skip documents with empty content. Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text from non-textual documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | The documents to split. |
Related Information
Was this page helpful?