DocumentPreprocessor

A SuperComponent that first splits and then cleans documents.

Basic Information

Type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
Components it can connect with:
- Any component that produces documents. It's usually used in indexes to process documents before writing them to a Document Store.
- Any component that consumes documents. It's usually used in indexes before DocumentWriter.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		Documents to process.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Processed list of documents.

Overview

DocumentPreprocessor is a SuperComponent that combines DocumentSplitter and DocumentCleaner into a single component. It preprocesses documents by first splitting them into smaller chunks and then cleaning them up.

It's used in indexes to process documents before writing them to a Document Store.

Usage Example

This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:

components:
  MultiFileConverter:
    type: haystack.components.converters.multi_file_converter.MultiFileConverter
    init_parameters:
      encoding: utf-8

  DocumentPreprocessor:
    type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
    init_parameters:
      split_by: word
      split_length: 200
      split_overlap: 30
      respect_sentence_boundary: true
      language: en
      remove_empty_lines: true
      remove_extra_whitespaces: true
      remove_repeated_substrings: false

  SentenceTransformersDocumentEmbedder:
    type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
    init_parameters:
      model: intfloat/e5-base-v2
      normalize_embeddings: true

  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      policy: OVERWRITE
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          embedding_dim: 768
          create_index: true

connections:
  - sender: MultiFileConverter.documents
    receiver: DocumentPreprocessor.documents
  - sender: DocumentPreprocessor.documents
    receiver: SentenceTransformersDocumentEmbedder.documents
  - sender: SentenceTransformersDocumentEmbedder.documents
    receiver: DocumentWriter.documents

max_runs_per_component: 100

inputs:
  files:
    - MultiFileConverter.sources

In this example:

MultiFileConverter converts uploaded files into documents.
DocumentPreprocessor splits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.
SentenceTransformersDocumentEmbedder generates embeddings for the processed documents.
DocumentWriter writes the embedded documents to OpenSearch.

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Splitter Parameters:

Parameter	Type	Default	Description
split_by	Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence']	word	The unit of splitting.
split_length	int	250	The maximum number of units (words, lines, pages, etc.) in each split.
split_overlap	int	0	The number of overlapping units between consecutive splits.
split_threshold	int	0	The minimum number of units per split. If a split is smaller, it's merged with the previous split.
splitting_function	Optional[Callable]	None	A custom function for splitting if `split_by="function"`.
respect_sentence_boundary	bool	False	If True, splits by words but tries not to break inside a sentence.
language	str	en	Language used by the sentence tokenizer.
use_split_rules	bool	True	Whether to apply additional splitting heuristics for the sentence splitter.
extend_abbreviations	bool	True	Whether to extend the sentence splitter with curated abbreviations for certain languages.

Cleaner Parameters:

Parameter	Type	Default	Description
remove_empty_lines	bool	True	If True, removes empty lines.
remove_extra_whitespaces	bool	True	If True, removes extra whitespaces.
remove_repeated_substrings	bool	False	If True, removes repeated substrings like headers/footers across pages.
keep_id	bool	False	If True, keeps the original document IDs.
remove_substrings	Optional[List[str]]	None	A list of strings to remove from the document content.
remove_regex	Optional[str]	None	A regex pattern whose matches are removed from the document content.
unicode_normalization	Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']]	None	Unicode normalization form to apply to the text.
ascii_only	bool	False	If True, converts text to ASCII only.

Run Method Parameters

These are the parameters you can configure for the component's run() method.

Parameter	Type	Default	Description
documents	List[Document]		Documents to process.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​