PreProcessors
These components are used in indexing pipelines to prepare your data for search by normalizing whitespaces, cleaning empty lines, or splitting documents into smaller chunks.
- DocumentCleaner: Makes document text more readable by removing extra whitespaces, empty lines, and the like.
- DocumentSplitter: Splits documents into shorter chunks.
- NLTKDocumentSplitter: Splits a list of documents into a list of shorter documents.
- TextCleaner: Removes regexes, punctuation, and numbers from text.
Updated 28 days ago