FileTypeClassifier
Use the FileTypeClassifier node to route files in an indexing pipeline to appropriate file converters.
FileTypeClassifier classifies files based on their extension and sends them to a file converter that can handle them. For example, if you have a collection of files of different types, such as PDFs, text files, Markdown, and HTML files, you can use FileTypeClassifier to route them to converters that can further process them into Documents.
FileTypeClassifier takes a file path as input and outputs the same path on the output edge corresponding to the file's extension. Output edge 1 serves text files and output edge 2 serves PDF files.
Note that deepset Cloud currently supports TXT and PDF file types.
Basic Information
- Pipeline type: Used in indexing pipelines.
- Nodes that can precede it in a pipeline: Used as the first node in indexing pipelines, takes
[File]
as input. - Nodes that can follow it in a pipeline: PDFToTextConverter and TextConverter
- Node input: File path
- Node output:File path
- Supported types: FileTypeClassifier
Branching Output
By default, FileTypeClassifier has five output branches or outgoing edges. When it receives a file, it routes it through one of these edges to a file converter, such as PDFToTextConverter or TextConverter, which then converts them into Document.
These are the default outgoing edges:
Outgoing Edge | File Type |
---|---|
1 | Text |
2 | |
3 | Markdown |
4 | DOCX |
5 | PPTX |
Usage Example
You can use FileTypeClassifier as the first node in your indexing pipeline. First, define it in the components section of your pipeline definition file and then in the pipelines section:
components:
- name: FileClassifier
type: FileTypeClassifier
params:
supported_types: ["txt", "pdf"]
...
pipelines:
- name: indexing
nodes:
- name: FileClassifier
inputs: [File]
#then you specify the input for the file converters:
- name: PDFToTextConverter
inputs: [FileClassifier.output_2] #this is output_2 because output edge 2 serves PDF files
- name: TextConverter
inputs: [FileClassifier.output_1]
#TextConverter takes output_1 because output edge 1 handles TXT files
Parameters
You can specify the following parameters for FileTypeClassifier
in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
supported_types | A list of strings | File extensions, such as txt , md , html , pdf , docx etc.Default: txt , pdf , md , docx , pptx | Specifies the file types that this node can distinguish. It's limited to a maximum of 10 file extensions. Lists containing more than 10 items and lists with duplicate elements are not allowed. Optional. |
Updated 7 months ago