PreProcessors

These components are used in indexing pipelines to prepare your data for search by normalizing whitespaces, cleaning empty lines, or splitting documents into smaller chunks.

Suggest Edits

DocumentCleaner: Makes document text more readable by removing extra whitespaces, empty lines, and the like.
DocumentSplitter: Splits documents into shorter chunks.
NLTKDocumentSplitter: Splits a list of documents into a list of shorter documents.
TextCleaner: Removes regexes, punctuation, and numbers from text.

Updated 10 months ago