PreProcessing Data with Pipeline Nodes
Learn about optimal ways to prepare your data using pipeline nodes available in deepset Cloud. If you need tips and guidelines, you'll find them here.
Indexing Pipeline for Preprocessing
The indexing pipeline converts your files to documents, preprocesses them, and finally stores them in the DeepsetCloudDocumentStore. The query pipeline then uses the documents from the DocumentStore for search.
deepset Cloud offers preprocessing nodes that you can add to your indexing pipeline. This way, when the pipeline runs, your files are automatically converted, split, and cleaned.
deepset Cloud supports PDF and TXT file types.
To learn more about files and documents, see Basic Concepts.
How to Prepare Your Files
Here's an outline of how to plan file preprocessing:
Your files determine which nodes to use in the indexing pipeline:
- If all your files are of one file type, use a file converter appropriate for handling this file type (
PDFToTextConverter
orTextConverter
) as the first node in your indexing pipeline. - If you have multiple file types, use
FileTypeClassifier
as the first node in your indexing pipeline, and a file converter as the second node.FileTypeClassifier
classifies your files based on their extension and sends them to the converter that can best handle them.
The converter's task is to convert your files into documents. However, the documents you obtain this way may not be of the optimal length for the retriever you want to use and may still need to be cleaned up. PreProcessor
is the node that handles the cleaning and splitting of documents. It removes headers and footers, which is useful for not breaking up the flow of sentences across pages, it deletes empty lines, and splits your documents into smaller ones.
Smaller documents speed up your pipeline. They're also optimal for dense retrievers, which often can't handle longer text passages. For example, DensePassageRetriever
was trained on documents 100-words long. That's the setting we recommend for dense retrievers. Sparse retrievers can work on slightly longer documents of around 200-300 words.
Use these suggestions as a starting point for your indexing pipeline. You may need to experiment with your settings to reach the optimal values for your use case.
For examples of indexing pipelines, see Sample Pipelines.
Pipeline Nodes for Preprocessing
There are a number of nodes that you can use in your indexing pipeline to preprocess your files. Have a look at this table to help you choose the right nodes:
Preprocessing Step | Node That Does It |
---|---|
Sort files by type and route them to appropriate converters for the file type. | FiletypeClassifier |
Convert a text file to a document object. | TextConverter |
Convert PDF files to a document object. | PDFToTextConverter |
Validate text language based on the ISO 639-1 format. | TextConverter, PDFToTextConverter |
Remove numeric rows from tables. | TextConverter, PDFToTextConverter |
Add metadata to the returned document. | TextConverter |
Split long documents into smaller ones. | PreProcessor |
Get rid of headers, footers, whitespace, and empty lines. | PreProcessor |
Updated 23 days ago