PreProcessor takes a single Document as input, cleans it up, and splits it into smaller documents to optimize it for search. Splitting is generally recommended for long Documents as it makes the retriever's job easier and speeds up question answering.

In a pipeline, you can use it to process the output of a TextConverter or PDFToTextConverter.

Basic Information

Pipeline type: Used in indexing pipelines
Nodes that can precede it in a pipeline: PDFToTextConverter, TextConverter, Retriever
Nodes that can follow it in a pipeline: Retriever (always used before EmbeddingRetriever, which creates embeddings and stores them in DocumentStore). PreProcessor can be the last node in an indexing pipeline.
Node input: Documents
Node output: Documents
Available node classes: PreProcessor, RegexPreprocessor (differs from PreProcessor in that its remove_substrings parameter also works with regex expressions. For more information, see the Arguments section).

Usage Example

...
components:
  - name: Preprocessor 
    type: PreProcessor 
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 20 # Enables the sliding window approach
      language: en
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
  ...
  pipelines:
 - name: indexing
   nodes:
      - name: TextFileConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextFileConverter]
  ...

Parameters

You can specify the following arguments for both PreProcessor and RegexPreprocessor in the pipeline YAML:

Parameter	Type	Possible Values	Description
`clean_header_footer`	Boolean	`True` `False` (default)	Uses heuristics to remove footers and headers across different pages by searching for the longest common string. This heuristic uses exact matches and works well for footers like Copyright 2019 by XXX, but doesn't detect Page 3 of 4 or similar. Mandatory.
`clean_whitespace`	Boolean	`True` (default) `False`	Strips whitespaces before or after each line in the text. Mandatory.
`clean_empty_lines`	Boolean	`True` (default) `False`	Removes more than two empty lines in text. Mandatory.
`remove_substrings`	List of strings (PreProcessor) or regex expression (RegexPreprocessor)		Removes specified substring from the text. If you don't provide any value, it creates an empty list. With RegexPreprocessor, you can use regex to remove the substrings. Optional.
`split_by`	Literal	`word` (default) `sentence` `passage` `page` `None`	Specifies the unit for splitting the document. If set to `None`, disables splitting. Optional.
`split_length`	Integer	Default: `200`	Specifies the maximum number of the split unit that is allowed in one document. For example, if you set `split_by` to `word` and `split_length` to `150`, the resulting Document will have no more than 150 words. It should be a positive integer. There are no constraints on the min and max value. Mandatory.
`split_overlap`	Integer	Default: `0`	Specifies the word overlap between two adjacent documents after a split. If you set it to a positive number, it enables the sliding window approach. For example, if you set `split_by` to `word`, `split_length` to `5`, and `split_overlap` to `2`, the overlap between the resulting documents will be like this: Document1: word1, word2, word3, word4, word5 Document2: word4, word5, word6, word7, word8. To ensure there's no overlap among the documents after splitting, set the value to `0`. Mandatory.
`split_respect_sentence_boundary`	Boolean	`True` (default) `False`	Specifies whether to preserve complete sentences when splitting Documents if `split_by` is set to `word`. If set to `True`, the individual split always has complete sentences, and the number of words is less than or equal to `split_length`. Mandatory.
`tokenizer_model_folder`	String	Deafult: `None`	The path to the folder containing the NTLK PunktSentenceTokenizer models if loading a model from a local path. Leave empty otherwise. Optional.
`language`	String	`ru` (Russian) `sl` (Slovenian) `es` (Spanish) `sv` (Swedish) `tr` (Turkish) `cs` (Czech) `da` (Danish) `nl` (Flemish) `en` (English) `et` (Estonian) `fi` (Finnish) `fr` (French) `de` (German) `el` (Greek) `it` (Italian) `no` (Norwegian) `pl` (Polish) `pt` (Portuguese) `ml` (Malayalam) Default: `en`	Specifies the language used by `nltk.tokenize.sent_tokenize` in the ISO 639 format. Mandatory.
`id_hash_keys`	A list of strings		Generates the document ID from a custom list of strings that refer to the document's attributes. To ensure you don't have duplicate documents in your DocumentStore, list the document attributes you want to use to identify duplicates. For example, to check for duplicates based on document contents, set `id_hash_keys` to `content`. This setting works in combination with the `duplicate_documents` setting in DeepsetCloudDocumentStore. For more information on handling duplicate documents, see Avoiding duplicate documents Optional.
`progress_bar`	Boolean	`True` (default) `False`	Shows the progress bar. Mandatory.
`add_page_number`	Boolean	`True` `False` (default)	Adds teh number of the page a paragraph occurs in to the document's meta field called `page`. It determines the page boundaries using the `\f` character that PDFToTextConverter adds in between pages. Mandatory.
`max_chars_check`	Integer	Default: `10_000`	The maximum number of characters a document can have. Each preprocessed document that is longer than the value specified here raises a warning. Mandatory.