MultiFileConverter

Convert multiple file types to documents in a single operation.

Basic Information

Type: haystack.components.converters.multi_file_converter.MultiFileConverter
Components it can connect with:
- FilesInput: MultiFileConverter receives files from FilesInput.
- Preprocessors: MultiFileConverter can send converted documents to a preprocessor such as DocumentSplitter.

Inputs

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		A list of file paths or byte streams to convert.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A list of converted documents. Only included if at least one file is successfully converted.
failed	List[str]		A list of files that failed to convert.

Overview

MultiFileConverter handles the conversion of multiple file types to documents in a single operation. It automatically detects file types and applies the appropriate converter for each file.

MultiFileConverter is a SuperComponent that combines FileTypeRouter, nine converters, and DocumentJoiner into a single component. This means you can use it to convert multiple file types without having to manually connect the components.

The component supports these file types:

CSV (text/csv) through an underlying CSVToDocument converter.
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) through an underlying DOCXToDocument converter.
HTML (text/html) through an underlying HTMLToDocument converter.
JSON (application/json) through an underlying JSONConverter converter.
MD (text/markdown) through an underlying MarkdownToDocument converter.
TEXT (text/plain) through an underlying TextFileToDocument converter.
PDF (application/pdf) through an underlying PDFMinerToDocument converter.
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) through an underlying PPTXToDocument converter.
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.presentation) through an underlying XLSXToDocument converter.

If a file fails to convert, it's added to the failed list in the output. The documents output is only included if at least one file is successfully converted.

Usage Example

Initializing the Component

components:
  MultiFileConverter:
    type: haystack.components.converters.multi_file_converter.MultiFileConverter
    init_parameters:
      encoding: utf-8
      json_content_key: text

Using the Component in an Index

In this index, MultiFileConverter receives files, converts them, and then sends them to DocumentSplitter.

components:
  MultiFileConverter:
    type: haystack.components.converters.multi_file_converter.MultiFileConverter
    init_parameters:
      encoding: utf-8
      json_content_key: content
  DocumentSplitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 200
      split_overlap: 0
      split_threshold: 0
      splitting_function:
      respect_sentence_boundary: false
      language: en
      use_split_rules: true
      extend_abbreviations: true
      skip_empty_documents: true
  DeepsetNvidiaDocumentEmbedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
    init_parameters:
      model: intfloat/multilingual-e5-base
      prefix: ''
      suffix: ''
      batch_size: 32
      meta_fields_to_embed:
      embedding_separator: \n
      truncate:
      normalize_embeddings: true
      timeout:
      backend_kwargs:
  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      policy: NONE
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:

connections:
- sender: MultiFileConverter.documents
  receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
  receiver: DeepsetNvidiaDocumentEmbedder.documents
- sender: DeepsetNvidiaDocumentEmbedder.documents
  receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
  files:
  - MultiFileConverter.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
encoding	str	utf-8	The encoding to use when reading text files.
json_content_key	str	content	The key to extract content from JSON files.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		A list of file paths or byte streams to convert.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Initializing the Component​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​