MultiFileConverter
Convert multiple file types to documents in a single operation.
Basic Information
- Type:
haystack.components.converters.multi_file_converter.MultiFileConverter - Components it can connect with:
FilesInput:MultiFileConverterreceives files fromFilesInput.Preprocessors:MultiFileConvertercan send converted documents to a preprocessor such asDocumentSplitter.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | A list of file paths or byte streams to convert. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of converted documents. Only included if at least one file is successfully converted. | |
| failed | List[str] | A list of files that failed to convert. |
Overview
MultiFileConverter handles the conversion of multiple file types to documents in a single operation. It automatically detects file types and applies the appropriate converter for each file.
MultiFileConverter is a SuperComponent that combines FileTypeRouter, nine converters, and DocumentJoiner into a single component. This means you can use it to convert multiple file types without having to manually connect the components.
The component supports these file types:
- CSV (text/csv) throughan underlying
CSVToDocumentconverter. - DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) through an underlying
DOCXToDocumentconverter. - HTML (text/html) through an underlying
HTMLToDocumentconverter. - JSON (application/json) through an underlying
JSONConverterconverter. - MD (text/markdown) through an underlying
MarkdownToDocumentconverter. - TEXT (text/plain) through an underlying
TextFileToDocumentconverter. - PDF (application/pdf) through an underlying
PDFMinerToDocumentconverter. - PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) through an underlying
PPTXToDocumentconverter. - XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.presentation) through an underlying
XLSXToDocumentconverter.
If a file fails to convert, it's added to the failed list in the output. The documents output is only included if at least one file is successfully converted.
Usage Example
Initializing the Component
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
json_content_key: text
Using the Component in an Index
In this index, MultiFileConverter receives files, converts them, and then sends them to DocumentSplitter.
components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
json_content_key: content
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
skip_empty_documents: true
DeepsetNvidiaDocumentEmbedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: intfloat/multilingual-e5-base
prefix: ''
suffix: ''
batch_size: 32
meta_fields_to_embed:
embedding_separator: \n
truncate:
normalize_embeddings: true
timeout:
backend_kwargs:
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
connections:
- sender: MultiFileConverter.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DeepsetNvidiaDocumentEmbedder.documents
- sender: DeepsetNvidiaDocumentEmbedder.documents
receiver: DocumentWriter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- MultiFileConverter.sources
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| encoding | str | utf-8 | The encoding to use when reading text files. |
| json_content_key | str | content | The key to extract content from JSON files. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| sources | List[Union[str, Path, ByteStream]] | A list of file paths or byte streams to convert. |
Was this page helpful?