PreProcessor takes a single Document as input, cleans it up, and splits it into smaller documents to optimize it for search. Splitting is generally recommended for long Documents as it makes the retriever's job easier and speeds up question answering.
In a pipeline, you can use it to process the output of a TextConverter or PDFToTextConverter.
- Pipeline type: Used in indexing pipelines
- Position in a pipeline: As early as possible but after file converters (TextConverter and PDFToTextConverter).
- Node input: Documents
- Node output: Documents
- Available node classes: PreProcessor
... components: - name: Preprocessor type: PreProcessor params: # With a vector-based retriever, it's good to split your documents into smaller ones split_by: word # The unit by which you want to split the documents split_length: 250 # The max number of words in a document split_overlap: 20 # Enables the sliding window approach language: en split_respect_sentence_boundary: True # Retains complete sentences in split documents ... pipelines: - name: indexing nodes: - name: TextFileConverter inputs: [File] - name: Preprocessor inputs: [TextFileConverter] ...
You can specify the following arguments for PreProcessor:
|Boolean||Uses heuristics to remove footers and headers across different pages by searching for the longest common string.|
This heuristic uses exact matches and works well for footers like Copyright 2019 by XXX, but doesn't detect Page 3 of 4 or similar.
|Boolean||Strips whitespaces before or after each line in the text.|
|Boolean||Removes more than two empty lines in text.|
|List of strings||Removes specified substring from the text. If you don't provide any value, it creates an empty list.|
|Literal||Specifies the unit for splitting the document. If set to |
|Integer||Default: ||Specifies the maximum number of the split unit that is allowed in one document. For example, if you set |
It should be a positive integer. There are no constraints on the min and max value.
|Integer||Default: ||Specifies the word overlap between two adjacent documents after a split. If you set it to a positive number, it enables the sliding window approach. For example, if you set |
Document1: word1, word2, word3, word4, word5
Document2: word4, word5, word6, word7, word8.
To ensure there's no overlap among the documents after splitting, set the value to
|Boolean||Specifies whether to preserve complete sentences when splitting Documents if |
If set to
|String||Deafult: ||The path to the folder containing the NTLK PunktSentenceTokenizer models if loading a model from a local path. Leave empty otherwise.|
|Specifies the language used by |
|A list of strings||Generates the document ID from a custom list of strings that refer to the document's attributes.|
|Boolean||Shows the progress bar.|
|Boolean||Adds teh number of the page a paragraph occurs in to the document's meta field called |
|Integer||Default: ||The maximum number of characters a document can have. Each preprocessed document that is longer than the value specified here raises a warning.|
Updated 19 days ago