Skip to main content

MultiFileConverter

Convert multiple file types to documents in a single operation.

Basic Information

  • Type: haystack.components.converters.multi_file_converter.MultiFileConverter
  • Components it can connect with:
    • FilesInput: MultiFileConverter receives files from FilesInput.
    • Preprocessors: MultiFileConverter can send converted documents to a preprocessor such as DocumentSplitter.

Inputs

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]A list of file paths or byte streams to convert.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A list of converted documents. Only included if at least one file is successfully converted.
failedList[str]A list of files that failed to convert.

Overview

MultiFileConverter handles the conversion of multiple file types to documents in a single operation. It automatically detects file types and applies the appropriate converter for each file.

MultiFileConverter is a SuperComponent that combines FileTypeRouter, nine converters, and DocumentJoiner into a single component. This means you can use it to convert multiple file types without having to manually connect the components.

The component supports these file types:

  • CSV (text/csv) throughan underlying CSVToDocument converter.
  • DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) through an underlying DOCXToDocument converter.
  • HTML (text/html) through an underlying HTMLToDocument converter.
  • JSON (application/json) through an underlying JSONConverter converter.
  • MD (text/markdown) through an underlying MarkdownToDocument converter.
  • TEXT (text/plain) through an underlying TextFileToDocument converter.
  • PDF (application/pdf) through an underlying PDFMinerToDocument converter.
  • PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) through an underlying PPTXToDocument converter.
  • XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.presentation) through an underlying XLSXToDocument converter.

If a file fails to convert, it's added to the failed list in the output. The documents output is only included if at least one file is successfully converted.

Usage Example

Initializing the Component

components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
json_content_key: text

Using the Component in an Index

In this index, MultiFileConverter receives files, converts them, and then sends them to DocumentSplitter.

components:
MultiFileConverter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
json_content_key: content
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function:
respect_sentence_boundary: false
language: en
use_split_rules: true
extend_abbreviations: true
skip_empty_documents: true
DeepsetNvidiaDocumentEmbedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: intfloat/multilingual-e5-base
prefix: ''
suffix: ''
batch_size: 32
meta_fields_to_embed:
embedding_separator: \n
truncate:
normalize_embeddings: true
timeout:
backend_kwargs:
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:

connections:
- sender: MultiFileConverter.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DeepsetNvidiaDocumentEmbedder.documents
- sender: DeepsetNvidiaDocumentEmbedder.documents
receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
files:
- MultiFileConverter.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
encodingstrutf-8The encoding to use when reading text files.
json_content_keystrcontentThe key to extract content from JSON files.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
sourcesList[Union[str, Path, ByteStream]]A list of file paths or byte streams to convert.