Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentPreprocessor

A SuperComponent that splits and cleans documents in a single step. It combines DocumentSplitter and DocumentCleaner into one component for use in indexing pipelines.

Key Features

  • Splits documents into smaller chunks using configurable units such as words, sentences, pages, or paragraphs.
  • Cleans documents by removing empty lines, extra whitespaces, and repeated substrings like headers and footers.
  • Supports overlapping splits to preserve context across chunks.
  • Respects sentence boundaries when splitting by word.
  • Supports custom splitting functions.

Configuration

  1. Drag the DocumentPreprocessor component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. Configure the component settings:
    • Choose a Split By unit: word, sentence, passage, page, line, or function.
    • Set Split Length to the maximum number of units in each chunk.
    • Set Split Overlap to the number of overlapping units between consecutive chunks.
    • Toggle Remove Empty Lines to remove empty lines from the document.
    • Toggle Remove Extra Whitespaces to remove extra whitespaces.
    • Set Split Threshold to define the minimum number of units per chunk. Chunks smaller than this threshold are merged with the previous chunk.
    • Toggle Respect Sentence Boundary to avoid splitting in the middle of a sentence when splitting by word.
    • Set Language for the NLTK sentence tokenizer (default: en).
    • Toggle Remove Repeated Substrings to remove repeated headers and footers across pages.
    • Toggle Keep ID to retain the original document IDs.
    • Set Remove Substrings to specify a list of strings to remove.
    • Set Remove Regex to specify a regular expression pattern whose matches are removed.
    • Set Unicode Normalization to apply Unicode normalization.
    • Toggle ASCII Only to convert text to ASCII only.

Connections

DocumentPreprocessor accepts a list of Document objects and outputs processed Document objects.

It's typically used in indexes between a document converter and a document embedder, or before DocumentWriter.

Source Code

To check this component's source code, open document_preprocessor.py in the Haystack repository.

Usage Examples

Basic Configuration

  DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false

In an Index

This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:

# haystack-pipeline
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8

DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false

SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true

connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

inputs:
files:
- MultiFileConverter.sources

In this example:

  1. MultiFileConverter converts uploaded files into documents.
  2. DocumentPreprocessor splits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.
  3. SentenceTransformersDocumentEmbedder generates embeddings for the processed documents.
  4. DocumentWriter writes the embedded documents to OpenSearch.

Parameters

Inputs

ParameterTypeDescription
documentsList[Document]Documents to process.

Outputs

ParameterTypeDescription
documentsList[Document]Processed list of documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Splitter Parameters:

ParameterTypeDefaultDescription
split_byLiteral['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']wordThe unit of splitting.
split_lengthint250The maximum number of units (words, lines, pages, etc.) in each split.
split_overlapint0The number of overlapping units between consecutive splits.
split_thresholdint0The minimum number of units per split. If a split is smaller, it's merged with the previous split.
splitting_functionOptional[Callable]NoneA custom function for splitting if split_by="function".
respect_sentence_boundaryboolFalseIf True, splits by words but tries not to break inside a sentence.
languagestrenLanguage used by the sentence tokenizer.
use_split_rulesboolTrueWhether to apply additional splitting heuristics for the sentence splitter.
extend_abbreviationsboolTrueWhether to extend the sentence splitter with curated abbreviations for certain languages.

Cleaner Parameters:

ParameterTypeDefaultDescription
remove_empty_linesboolTrueIf True, removes empty lines.
remove_extra_whitespacesboolTrueIf True, removes extra whitespaces.
remove_repeated_substringsboolFalseIf True, removes repeated substrings like headers/footers across pages.
keep_idboolFalseIf True, keeps the original document IDs.
remove_substringsOptional[List[str]]NoneA list of strings to remove from the document content.
remove_regexOptional[str]NoneA regex pattern whose matches are removed from the document content.
unicode_normalizationOptional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]NoneUnicode normalization form to apply to the text.
ascii_onlyboolFalseIf True, converts text to ASCII only.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsList[Document]Documents to process.