DocumentPreprocessor
A SuperComponent that first splits and then cleans documents.
Basic Information
- Type:
haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor - Components it can connect with:
- Any component that produces
documents. It's usually used in indexes to process documents before writing them to a Document Store. - Any component that consumes
documents. It's usually used in indexes beforeDocumentWriter.
- Any component that produces
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Processed list of documents. |
Overview
DocumentPreprocessor is a SuperComponent that combines DocumentSplitter and DocumentCleaner into a single component. It preprocesses documents by first splitting them into smaller chunks and then cleaning them up.
It's used in indexes to process documents before writing them to a Document Store.
Usage Example
This index pipeline uses DocumentPreprocessor to split and clean documents before embedding and writing them to a Document Store:
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
DocumentPreprocessor:
type: haystack.components.preprocessors.document_preprocessor.DocumentPreprocessor
init_parameters:
split_by: word
split_length: 200
split_overlap: 30
respect_sentence_boundary: true
language: en
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
SentenceTransformersDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: OVERWRITE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
embedding_dim: 768
create_index: true
connections:
- sender: MultiFileConverter.documents
receiver: DocumentPreprocessor.documents
- sender: DocumentPreprocessor.documents
receiver: SentenceTransformersDocumentEmbedder.documents
- sender: SentenceTransformersDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
inputs:
files:
- MultiFileConverter.sources
In this example:
MultiFileConverterconverts uploaded files into documents.DocumentPreprocessorsplits documents into chunks of 200 words with 30-word overlap, respecting sentence boundaries. It also removes empty lines and extra whitespace.SentenceTransformersDocumentEmbeddergenerates embeddings for the processed documents.DocumentWriterwrites the embedded documents to OpenSearch.
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
Splitter Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| split_by | Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence'] | word | The unit of splitting. |
| split_length | int | 250 | The maximum number of units (words, lines, pages, etc.) in each split. |
| split_overlap | int | 0 | The number of overlapping units between consecutive splits. |
| split_threshold | int | 0 | The minimum number of units per split. If a split is smaller, it's merged with the previous split. |
| splitting_function | Optional[Callable] | None | A custom function for splitting if split_by="function". |
| respect_sentence_boundary | bool | False | If True, splits by words but tries not to break inside a sentence. |
| language | str | en | Language used by the sentence tokenizer. |
| use_split_rules | bool | True | Whether to apply additional splitting heuristics for the sentence splitter. |
| extend_abbreviations | bool | True | Whether to extend the sentence splitter with curated abbreviations for certain languages. |
Cleaner Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| remove_empty_lines | bool | True | If True, removes empty lines. |
| remove_extra_whitespaces | bool | True | If True, removes extra whitespaces. |
| remove_repeated_substrings | bool | False | If True, removes repeated substrings like headers/footers across pages. |
| keep_id | bool | False | If True, keeps the original document IDs. |
| remove_substrings | Optional[List[str]] | None | A list of strings to remove from the document content. |
| remove_regex | Optional[str] | None | A regex pattern whose matches are removed from the document content. |
| unicode_normalization | Optional[Literal['NFC', 'NFKC', 'NFD', 'NFKD']] | None | Unicode normalization form to apply to the text. |
| ascii_only | bool | False | If True, converts text to ASCII only. |
Run Method Parameters
These are the parameters you can configure for the component's run() method.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to process. |
Was this page helpful?