Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

https://unstructured.io/FileConverter

Converts files to Haystack Documents using the Unstructured API, which supports a wide range of file types.

Key Features

  • Converts many file types through the Unstructured API (hosted or local).
  • Three document creation modes: one document per file, per page, or per element.
  • Configurable element separator for controlling text concatenation.
  • Works with the hosted Unstructured API or a locally running instance.
  • Optional progress bar for tracking batch conversions.

Configuration

Authentication

You need an Unstructured API key to use the hosted version of this component. Set the UNSTRUCTURED_API_KEY environment variable or store it as a workspace secret. If you run Unstructured locally, no API key is needed.

  1. Drag the UnstructuredFileConverter component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

UnstructuredFileConverter accepts a list of file paths or directories as input. It outputs a list of Haystack documents.

Connect the pipeline's file path input to its paths input. Connect its documents output to a DocumentSplitter or DocumentWriter.

Usage Example

components:
https://unstructured.io/FileConverter:
type: unstructured.src.haystack_integrations.components.converters.unstructured.converter.https://unstructured.io/FileConverter
init_parameters:

Parameters

Inputs

ParameterTypeDefaultDescription
pathsUnion[List[str], List[os.PathLike]]List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneOptional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files).

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A dictionary with the following key: - documents: List of Haystack Documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
api_urlstrUNSTRUCTURED_HOSTED_API_URLURL of the https://unstructured.io/ API. Defaults to the URL of the hosted version. If you run the API locally, specify the URL of your local API (e.g. "http://localhost:8000/general/v0/general").
api_keyOptional[Secret]Secret.from_env_var('UNSTRUCTURED_API_KEY', strict=False)API key for the https://unstructured.io/ API. It can be explicitly passed or read the environment variable UNSTRUCTURED_API_KEY (recommended). If you run the API locally, it is not needed.
document_creation_modeLiteral['one-doc-per-file', 'one-doc-per-page', 'one-doc-per-element']one-doc-per-fileHow to create Haystack Documents from the elements returned by https://unstructured.io/. "one-doc-per-file": One Haystack Document per file. All elements are concatenated into one text field. "one-doc-per-page": One Haystack Document per page. All elements on a page are concatenated into one text field. "one-doc-per-element": One Haystack Document per element. Each element is converted to a Haystack Document.
separatorstr\n\nSeparator between elements when concatenating them into one text field.
unstructured_kwargsOptional[Dict[str, Any]]NoneAdditional parameters that are passed to the https://unstructured.io/ API. For the available parameters, see https://unstructured.io/ API docs.
progress_barboolTrueWhether to show a progress bar during the conversion.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
pathsUnion[List[str], List[os.PathLike]]List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
metaOptional[Union[Dict[str, Any], List[Dict[str, Any]]]]NoneOptional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files).