DocumentPreprocessor
A SuperComponent that splits and cleans documents in a single step. It combines DocumentSplitter and DocumentCleaner into one component for use in indexing pipelines.
Key Features
- Splits documents into smaller chunks using configurable units such as words, sentences, pages, or paragraphs.
- Cleans documents by removing empty lines, extra whitespaces, and repeated substrings like headers and footers.
- Supports overlapping splits to preserve context across chunks.
- Respects sentence boundaries when splitting by word.
- Supports custom splitting functions.
Configuration
- Drag the
DocumentPreprocessorcomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Choose a Split By unit:
word,sentence,passage,page,line, orfunction. - Set Split Length to the maximum number of units in each chunk.
- Set Split Overlap to the number of overlapping units between consecutive chunks.
- Toggle Remove Empty Lines to remove empty lines from the document.
- Toggle Remove Extra Whitespaces to remove extra whitespaces.
- Set Split Threshold to define the minimum number of units per chunk. Chunks smaller than this threshold are merged with the previous chunk.
- Toggle Respect Sentence Boundary to avoid splitting in the middle of a sentence when splitting by word.
- Set Language for the NLTK sentence tokenizer (default:
en). - Toggle Remove Repeated Substrings to remove repeated headers and footers across pages.
- Toggle Keep ID to retain the original document IDs.
- Set Remove Substrings to specify a list of strings to remove.
- Set Remove Regex to specify a regular expression pattern whose matches are removed.
- Set Unicode Normalization to apply Unicode normalization.
- Toggle ASCII Only to convert text to ASCII only.
- Choose a Split By unit:
Connections
DocumentPreprocessor accepts a list of Document objects and outputs processed Document objects.
It's typically used in indexes between a document converter and a document embedder, or before DocumentWriter.
Source Code
To check this component's source code, open document_preprocessor.py in the Haystack repository.
Usage Examples
Basic Configuration
DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
In an Index
This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:
# haystack-pipeline
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true
connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
inputs:
files:
- MultiFileConverter.sources
In this example:
MultiFileConverterconverts uploaded files into documents.DocumentPreprocessorsplits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.SentenceTransformersDocumentEmbeddergenerates embeddings for the processed documents.DocumentWriterwrites the embedded documents to OpenSearch.
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Documents to process. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Processed list of documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
Splitter Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
split_by | Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence'] | word | The unit of splitting. |
split_length | int | 250 | The maximum number of units (words, lines, pages, etc.) in each split. |
split_overlap | int | 0 | The number of overlapping units between consecutive splits. |
split_threshold | int | 0 | The minimum number of units per split. If a split is smaller, it's merged with the previous split. |
splitting_function | Optional[Callable] | None | A custom function for splitting if split_by="function". |
respect_sentence_boundary | bool | False | If True, splits by words but tries not to break inside a sentence. |
language | str | en | Language used by the sentence tokenizer. |
use_split_rules | bool | True | Whether to apply additional splitting heuristics for the sentence splitter. |
extend_abbreviations | bool | True | Whether to extend the sentence splitter with curated abbreviations for certain languages. |
Cleaner Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
remove_empty_lines | bool | True | If True, removes empty lines. |
remove_extra_whitespaces | bool | True | If True, removes extra whitespaces. |
remove_repeated_substrings | bool | False | If True, removes repeated substrings like headers/footers across pages. |
keep_id | bool | False | If True, keeps the original document IDs. |
remove_substrings | Optional[List[str]] | None | A list of strings to remove from the document content. |
remove_regex | Optional[str] | None | A regex pattern whose matches are removed from the document content. |
unicode_normalization | Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']] | None | Unicode normalization form to apply to the text. |
ascii_only | bool | False | If True, converts text to ASCII only. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Documents to process. |
Related Information
Was this page helpful?