PreProcessors

These components are used in indexing pipelines to prepare your data for search by normalizing whitespaces, cleaning empty lines, or splitting documents into smaller chunks.

  • DocumentCleaner: Makes document text more readable by removing extra whitespaces, empty lines, and the like.
  • DocumentSplitter: Splits documents into shorter chunks.
  • NLTKDocumentSplitter: Splits a list of documents into a list of shorter documents.
  • TextCleaner: Removes regexes, punctuation, and numbers from text.