Skip to main content

DocumentPreprocessor

A SuperComponent that first splits and then cleans documents.

Basic Information

  • Type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
  • Components it can connect with:
    • Any component that produces documents. It's usually used in indexes to process documents before writing them to a Document Store.
    • Any component that consumes documents. It's usually used in indexes before DocumentWriter.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]Processed list of documents.

Overview

DocumentPreprocessor is a SuperComponent that combines DocumentSplitter and DocumentCleaner into a single component. It preprocesses documents by first splitting them into smaller chunks and then cleaning them up.

It's used in indexes to process documents before writing them to a Document Store.

Usage Example

This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:

components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8

DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false

SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true

DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true

connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

inputs:
files:
- MultiFileConverter.sources

In this example:

  1. MultiFileConverter converts uploaded files into documents.
  2. DocumentPreprocessor splits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.
  3. SentenceTransformersDocumentEmbedder generates embeddings for the processed documents.
  4. DocumentWriter writes the embedded documents to OpenSearch.

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Splitter Parameters:

ParameterTypeDefaultDescription
split_byLiteral['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']wordThe unit of splitting.
split_lengthint250The maximum number of units (words, lines, pages, etc.) in each split.
split_overlapint0The number of overlapping units between consecutive splits.
split_thresholdint0The minimum number of units per split. If a split is smaller, it's merged with the previous split.
splitting_functionOptional[Callable]NoneA custom function for splitting if split_by="function".
respect_sentence_boundaryboolFalseIf True, splits by words but tries not to break inside a sentence.
languagestrenLanguage used by the sentence tokenizer.
use_split_rulesboolTrueWhether to apply additional splitting heuristics for the sentence splitter.
extend_abbreviationsboolTrueWhether to extend the sentence splitter with curated abbreviations for certain languages.

Cleaner Parameters:

ParameterTypeDefaultDescription
remove_empty_linesboolTrueIf True, removes empty lines.
remove_extra_whitespacesboolTrueIf True, removes extra whitespaces.
remove_repeated_substringsboolFalseIf True, removes repeated substrings like headers/footers across pages.
keep_idboolFalseIf True, keeps the original document IDs.
remove_substringsOptional[List[str]]NoneA list of strings to remove from the document content.
remove_regexOptional[str]NoneA regex pattern whose matches are removed from the document content.
unicode_normalizationOptional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]NoneUnicode normalization form to apply to the text.
ascii_onlyboolFalseIf True, converts text to ASCII only.

Run Method Parameters

These are the parameters you can configure for the component's run() method.

ParameterTypeDefaultDescription
documentsList[Document]Documents to process.