PreProcessor

Learn how to preprocess your files before running a search on them.

PreProcessor takes a single document as input, cleans it up, and splits it into smaller documents. In a pipeline, you can use it to process the output of a TextConverter or PDFToTextConverter.

Usage

To specify PreProcessor, run:

#Import preprocessor:
from haystack.nodes import PreProcessor

#Specify the processor and its arguments:
processor = PreProcessor (
  clean_empty_lines=True,
  clean_whitespace=True,
  language="en", # Remember to set the right language
)

Or using YAML:

version: "1.10.0"

components:
 - name: Preprocessor
   type: PreProcessor
   params:
      clean_empty_lines: True
      clean_whitespace: True
      language: en # Remember to set the right language

In a pipeline, you need to specify the input that PreProcessor takes:

pipelines:
 - name: indexing
   nodes:
      - name: TextFileConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextFileConverter]

You can specify the following arguments for PreProcessor:

ArgumentTypePossible ValuesDescription
clean_header_footerBooleanTrue/FalseUses heuristic to remove footers and headers across different pages by searching for the longest common string.
This heuristic uses exact matches and works well for footers like Copyright 2019 by XXX, but doesn't detect Page 3 of 4 or similar.
clean_whitespaceBooleanTrue/FalseStrips whitespaces before or after each line in text.
clean_empty_linesBooleanTrue/FalseRemoves more than two empty lines in text.
remove_substringsList of stringsRemoves specified substring from text.
split_byStringWord
Sentence
Passage
None
Specifies the unit for splitting the document. If set to None, disables splitting.
split_lengthIntegerSpecifies the maximum number of the split unit that is allowed in one document.
split_overlapIntegerSpecifies the word overlap between two adjacent documents after a split. If you set it to a positive number, it enables the sliding window approach.
To ensure that there is no overlap among the documents after splitting, set the value to 0.
split_respect_sentence_boundaryBooleanTrue/FalseSpecifies whether to split in partial sentences if split_by is set to word.
If set to True, the individual split always has complete sentences and the number of words is less than or equal to split_length.
languageStringLanguage code, for example: en, fr, es, de, ru, sl, sv, tr, cs, da, nl, et, fi, el, it, no, pl, ptSpecifies the language used by nltk.tokenize.sent_tokenize in the iso639 format.
id_hash_keysA list of stringsGenerates the document ID from a custom list of strings that refer to the document's attributes.

Related Links