FileTypeClassifier
Use the FileTypeClassifier
node to route files in an indexing pipeline to appropriate file converters.
FileTypeClassifier
classifies files based on their extension and sends them to a file converter that can handle them. For example, if you have a collection of files of different types, such as PDFs, text files, markdown, and HTML files, you can use FileTypeClassifier
to route them to converters that can further process them into documents.
FileTypeClassifier
takes a path as input and outputs the same path on the output edge that corresponds to the file's extension.
Usage
You can use FileTypeClassifier
as the first node in your indexing pipeline. First, define it in the components section of your pipeline definition file:
components:
- name: FileClassifier
type: FileTypeClassifier
params:
supported_types: ["txt", "pdf"]
Then, add FileTypeClassifier
to your indexing pipeline:
pipelines:
- name: indexing
nodes:
- name: FileClassifier
inputs: [File]
#then you specify the input for the file converters:
- name: PDFToTextConverter
inputs: [FileClassifier.output_2] #this is output_2 because "PDF" is specified as the second extension in "supported_types"
- name: TextConverter
inputs: [FileClassifier.output_1]
#TextConverter takes output_1 because "txt" is specified
# as the first extension in "supported_types"
Arguments
You can specify the following arguments for FileTypeClassifier
:
Argument | Type | Possible Values | Description |
---|---|---|---|
supported_types | A list of strings | File extensions, such as txt , md , html , pdf , docx etc. | Specifies the file types that this node can distinguish. It's limited to a maximum of 10 file extensions. Lists containing more than 10 items and lists with duplicate elements are not allowed. |
Updated 3 months ago