DocumentSplitter

Split long documents into smaller chunks. Use this components in your indexes to prepare data for search.

Basic Information

Type: haystack.components.preprocessors.document_splitter.DocumentSplitter
Components it can connect with:
- Converters: DocumentSplitter receives documents from converters or DocumentCleaner.
- DocumentCleaner: DocumentSplitter can receive cleaned documents from DocumentCleaner.
- Embedders: DocumentSplitter sends split documents to document embedders like SentenceTransformersDocumentEmbedder.
- DocumentWriter: DocumentSplitter can send documents to DocumentWriter for storage.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		The documents to split.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		List of documents with split texts. Each document includes `source_id` and `page_number` metadata fields.

Overview

DocumentSplitter divides long documents into smaller chunks. This is a common preprocessing step during indexing that helps embedders create meaningful semantic representations and prevents exceeding language model context limits.

The component splits documents by the specified unit (split_by) after a certain number of units (split_length) with optional overlap (split_overlap):

split_by: The unit for splitting - word, sentence, passage (paragraph), page, line, or function
split_length: The maximum number of units in each chunk
split_overlap: The number of overlapping units between chunks
split_threshold: The minimum number of units per chunk (smaller chunks are attached to the previous one)

Each split document includes metadata:

source_id: Tracks the original document
page_number: Tracks the original page number
split_id: The order of the split

For sentence-based splitting, you can use respect_sentence_boundary to ensure splits occur only between sentences.

Usage Example

Using the Component in an Index

This example shows a typical index where DocumentSplitter chunks documents after cleaning and before embedding.

components:
  TextFileToDocument:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8
      store_full_path: false
  DocumentCleaner:
    type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
    init_parameters:
      remove_empty_lines: true
      remove_extra_whitespaces: true
      remove_repeated_substrings: false
  DocumentSplitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 200
      split_overlap: 20
      split_threshold: 0
      splitting_function:
      respect_sentence_boundary: false
      language: en
      use_split_rules: true
      extend_abbreviations: true
  SentenceTransformersDocumentEmbedder:
    type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
    init_parameters:
      model: sentence-transformers/all-MiniLM-L6-v2
      device:
      token:
      prefix: ''
      suffix: ''
      batch_size: 32
      progress_bar: true
      normalize_embeddings: false
      trust_remote_code: false
  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: documents-index
          max_chunk_bytes: 104857600
          embedding_dim: 384
          return_embedding: false
          create_index: true
          similarity: cosine
      policy: NONE

connections:
- sender: TextFileToDocument.documents
  receiver: DocumentCleaner.documents
- sender: DocumentCleaner.documents
  receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
  receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
  receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
  files:
  - TextFileToDocument.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
split_by	Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']	word	The unit for splitting your documents. Choose from: - `word` for splitting by spaces (" ") - `period` for splitting by periods (".") - `page` for splitting by form feed ("\f") - `passage` for splitting by double line breaks ("\n\n") - `line` for splitting each line ("\n") - `sentence` for splitting by NLTK sentence tokenizer
split_length	int	200	The maximum number of units in each split.
split_overlap	int	0	The number of overlapping units for each split.
split_threshold	int	0	The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.
splitting_function	Optional[Callable[[str], List[str]]]	None	Necessary when `split_by` is set to "function". This is a function which must accept a single `str` as input and return a `list` of `str` as output, representing the chunks after splitting.
respect_sentence_boundary	bool	False	Choose whether to respect sentence boundaries when splitting by "word". If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences.
language	Language	en	Choose the language for the NLTK tokenizer. The default is English ("en").
use_split_rules	bool	True	Choose whether to use additional split rules when splitting by `sentence`.
extend_abbreviations	bool	True	Choose whether to extend NLTK's PunktTokenizer abbreviations with a list of curated abbreviations, if available. This is currently supported for English ("en") and German ("de").
skip_empty_documents	bool	True	Choose whether to skip documents with empty content. Set to `False` when downstream components in the Pipeline (like `LLMDocumentContentExtractor`) can extract text from non-textual documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		The documents to split.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​