PreProcessor Parameters

Check the init and runtime parameters for PreProcessor.

YAML Init Parameters

These are the parameters you can specify in pipeline YAML:

ParameterTypePossible ValuesDescription
clean_header_footerBooleanTrue
False (default)
Uses heuristics to remove footers and headers across different pages by searching for the longest common string.
This heuristic uses exact matches and works well for footers like Copyright 2019 by XXX, but doesn't detect Page 3 of 4 or similar.
Mandatory.
clean_whitespaceBooleanTrue (default)
False
Strips whitespaces before or after each line in the text.
Mandatory.
clean_empty_linesBooleanTrue (default)
False
Removes more than two empty lines in text.
Mandatory.
remove_substringsList of strings (PreProcessor) or regex expression (RegexPreprocessor)Removes specified substring from the text. If you don't provide any value, it creates an empty list.

With RegexPreprocessor, you can use regex to remove the substrings.

Optional.
split_byLiteralword (default)
sentence
passage
page
None
Specifies the unit for splitting the document. If set to None, disables splitting.
Optional.
split_lengthIntegerDefault: 200Specifies the maximum number of the split unit that is allowed in one document. For example, if you set split_by to word and split_length to 150, the resulting Document will have no more than 150 words.
It should be a positive integer. There are no constraints on the min and max value.
Mandatory.
split_overlapIntegerDefault: 0Specifies the word overlap between two adjacent documents after a split. If you set it to a positive number, it enables the sliding window approach. For example, if you set split_by to word, split_length to 5, and split_overlap to 2, the overlap between the resulting documents will be like this:
Document1: word1, word2, word3, word4, word5
Document2: word4, word5, word6, word7, word8.

To ensure there's no overlap among the documents after splitting, set the value to 0.
Mandatory.
split_respect_sentence_boundaryBooleanTrue (default)
False
Specifies whether to preserve complete sentences when splitting Documents if split_by is set to word.
If set to True, the individual split always has complete sentences, and the number of words is less than or equal to split_length.
Mandatory.
tokenizer_model_folderStringDeafult: NoneThe path to the folder containing the NTLK PunktSentenceTokenizer models if loading a model from a local path. Leave empty otherwise.
Optional.
languageStringru (Russian)
sl (Slovenian)
es (Spanish)
sv (Swedish)
tr (Turkish)
cs (Czech)
da (Danish)
nl (Flemish)
en (English)
et (Estonian)
fi (Finnish)
fr (French)
de (German)
el (Greek)
it (Italian)
no (Norwegian)
pl (Polish)
pt (Portuguese)
ml (Malayalam)
Default: en
Specifies the language used by nltk.tokenize.sent_tokenize in the ISO 639 format.
Mandatory.
id_hash_keysA list of stringsGenerates the document ID from a custom list of strings that refer to the document's attributes.
To ensure you don't have duplicate documents in your DocumentStore, list the document attributes you want to use to identify duplicates. For example, to check for duplicates based on document contents, set id_hash_keys to content. This setting works in combination with the duplicate_documents setting in DeepsetCloudDocumentStore.
For more information on handling duplicate documents, see Avoiding duplicate documents
Optional.
progress_barBooleanTrue (default)
False
Shows the progress bar.
Mandatory.
add_page_numberBooleanTrue
False (default)
Adds teh number of the page a paragraph occurs in to the document's meta field called page. It determines the page boundaries using the \f character that PDFToTextConverter adds in between pages.
Mandatory.
max_chars_checkIntegerDefault: 10_000The maximum number of characters a document can have. Each preprocessed document that is longer than the value specified here raises a warning.
Mandatory.

REST API Runtime Parameters

There are no runtime parameters you can pass to this node when making a request to the Search REST API endpoint.