Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentPreprocessor

A SuperComponent that combines DocumentSplitter and DocumentCleaner to split and clean documents in a single step.

Key Features

  • Combines document splitting and cleaning into one component, reducing pipeline complexity.
  • Splits documents by word, sentence, page, passage, line, or a custom function.
  • Configurable chunk size, overlap, and minimum chunk threshold.
  • Removes empty lines, extra whitespaces, and repeated substrings such as headers and footers.
  • Supports Unicode normalization and ASCII-only conversion.
  • Respects sentence boundaries when splitting by word.

Configuration

  1. Drag the DocumentPreprocessor component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

DocumentPreprocessor accepts a list of documents (documents) as input and outputs a list of processed documents (documents).

Connect any component that produces documents to the input — typically a converter in an indexing pipeline. Connect the output to document embedders or DocumentWriter.

Usage Example

This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:

components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8

DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false

SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true

connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

inputs:
files:
- MultiFileConverter.sources

In this example:

  1. MultiFileConverter converts uploaded files into documents.
  2. DocumentPreprocessor splits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.
  3. SentenceTransformersDocumentEmbedder generates embeddings for the processed documents.
  4. DocumentWriter writes the embedded documents to OpenSearch.

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Processed list of documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Splitter Parameters:

ParameterTypeDefaultDescription
split_byLiteral['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']wordThe unit of splitting.
split_lengthint250The maximum number of units (words, lines, pages, etc.) in each split.
split_overlapint0The number of overlapping units between consecutive splits.
split_thresholdint0The minimum number of units per split. If a split is smaller, it's merged with the previous split.
splitting_functionOptional[Callable]NoneA custom function for splitting if split_by="function".
respect_sentence_boundaryboolFalseIf True, splits by words but tries not to break inside a sentence.
languagestrenLanguage used by the sentence tokenizer.
use_split_rulesboolTrueWhether to apply additional splitting heuristics for the sentence splitter.
extend_abbreviationsboolTrueWhether to extend the sentence splitter with curated abbreviations for certain languages.

Cleaner Parameters:

ParameterTypeDefaultDescription
remove_empty_linesboolTrueIf True, removes empty lines.
remove_extra_whitespacesboolTrueIf True, removes extra whitespaces.
remove_repeated_substringsboolFalseIf True, removes repeated substrings like headers/footers across pages.
keep_idboolFalseIf True, keeps the original document IDs.
remove_substringsOptional[List[str]]NoneA list of strings to remove from the document content.
remove_regexOptional[str]NoneA regex pattern whose matches are removed from the document content.
unicode_normalizationOptional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]NoneUnicode normalization form to apply to the text.
ascii_onlyboolFalseIf True, converts text to ASCII only.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.