DocumentPreprocessor
A SuperComponent that combines DocumentSplitter and DocumentCleaner to split and clean documents in a single step.
Key Features
- Combines document splitting and cleaning into one component, reducing pipeline complexity.
- Splits documents by word, sentence, page, passage, line, or a custom function.
- Configurable chunk size, overlap, and minimum chunk threshold.
- Removes empty lines, extra whitespaces, and repeated substrings such as headers and footers.
- Supports Unicode normalization and ASCII-only conversion.
- Respects sentence boundaries when splitting by word.
Configuration
- Drag the
DocumentPreprocessorcomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
DocumentPreprocessor accepts a list of documents (documents) as input and outputs a list of processed documents (documents).
Connect any component that produces documents to the input — typically a converter in an indexing pipeline. Connect the output to document embedders or DocumentWriter.
Usage Example
This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true
connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
inputs:
files:
- MultiFileConverter.sources
In this example:
MultiFileConverterconverts uploaded files into documents.DocumentPreprocessorsplits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.SentenceTransformersDocumentEmbeddergenerates embeddings for the processed documents.DocumentWriterwrites the embedded documents to OpenSearch.
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Processed list of documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
Splitter Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| split_by | Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence'] | word | The unit of splitting. |
| split_length | int | 250 | The maximum number of units (words, lines, pages, etc.) in each split. |
| split_overlap | int | 0 | The number of overlapping units between consecutive splits. |
| split_threshold | int | 0 | The minimum number of units per split. If a split is smaller, it's merged with the previous split. |
| splitting_function | Optional[Callable] | None | A custom function for splitting if split_by="function". |
| respect_sentence_boundary | bool | False | If True, splits by words but tries not to break inside a sentence. |
| language | str | en | Language used by the sentence tokenizer. |
| use_split_rules | bool | True | Whether to apply additional splitting heuristics for the sentence splitter. |
| extend_abbreviations | bool | True | Whether to extend the sentence splitter with curated abbreviations for certain languages. |
Cleaner Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| remove_empty_lines | bool | True | If True, removes empty lines. |
| remove_extra_whitespaces | bool | True | If True, removes extra whitespaces. |
| remove_repeated_substrings | bool | False | If True, removes repeated substrings like headers/footers across pages. |
| keep_id | bool | False | If True, keeps the original document IDs. |
| remove_substrings | Optional[List[str]] | None | A list of strings to remove from the document content. |
| remove_regex | Optional[str] | None | A regex pattern whose matches are removed from the document content. |
| unicode_normalization | Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']] | None | Unicode normalization form to apply to the text. |
| ascii_only | bool | False | If True, converts text to ASCII only. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. |
Was this page helpful?