https://unstructured.io/FileConverter
Convert files to Haystack Documents using the Unstructured.io API (hosted or running locally). For supported file types and API parameters, see Unstructured.io docs.
Key Features
- Supports a wide range of file types via the Unstructured.io API.
- Works with both the hosted Unstructured.io API and a locally running instance.
- Configurable document creation mode: one document per file, per page, or per element.
- Configurable separator for concatenating elements.
- Supports passing additional parameters to the Unstructured.io API.
Configuration
- Drag the
UnstructuredFileConvertercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set the
UNSTRUCTURED_API_KEYsecret in your workspace if using the hosted API. For instructions, see Add Secrets. If you run the API locally, an API key is not needed. - Set the
api_url. If using the hosted API, you can leave this as the default. If running locally, set it to your local API URL (for example,http://localhost:8000/general/v0/general). - Choose the
document_creation_mode.
- Set the
- Go to the Advanced tab to configure
unstructured_kwargs,separator, andprogress_bar.
Connections
UnstructuredFileConverter receives file paths as input. It sends converted documents to downstream processing components like DocumentSplitter or DocumentWriter.
Source Code
To check this component's source code, open converter.py in the Haystack Core Integrations repository.
Usage Examples
Basic Configuration
UnstructuredFileConverter:
type: haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter
init_parameters:
api_key:
type: env_var
env_vars:
- UNSTRUCTURED_API_KEY
strict: false
Connect the pipeline's file path input to its paths input. Connect its documents output to a DocumentSplitter or DocumentWriter.
components:
https://unstructured.io/FileConverter:
type: unstructured.src.haystack_integrations.components.converters.unstructured.converter.https://unstructured.io/FileConverter
init_parameters:
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
paths | Union[List[str], List[os.PathLike]] | List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored. |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files). |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of Haystack Documents. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| api_url | str | UNSTRUCTURED_HOSTED_API_URL | URL of the Unstructured.io API. Defaults to the URL of the hosted version. If you run the API locally, specify the URL of your local API (for example, "http://localhost:8000/general/v0/general"). |
| api_key | Optional[Secret] | Secret.from_env_var('UNSTRUCTURED_API_KEY', strict=False) | API key for the Unstructured.io API. It can be explicitly passed or read from the environment variable UNSTRUCTURED_API_KEY (recommended). If you run the API locally, it is not needed. |
| document_creation_mode | Literal['one-doc-per-file', 'one-doc-per-page', 'one-doc-per-element'] | one-doc-per-file | How to create Haystack Documents from the elements returned by Unstructured.io. "one-doc-per-file": One Haystack Document per file. All elements are concatenated into one text field. "one-doc-per-page": One Haystack Document per page. All elements on a page are concatenated into one text field. "one-doc-per-element": One Haystack Document per element. Each element is converted to a Haystack Document. |
| separator | str | \n\n | Separator between elements when concatenating them into one text field. |
| unstructured_kwargs | Optional[Dict[str, Any]] | None | Additional parameters that are passed to the Unstructured.io API. For the available parameters, see Unstructured.io API docs. |
| progress_bar | bool | True | Whether to show a progress bar during the conversion. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| paths | Union[List[str], List[os.PathLike]] | List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files). |
Related Information
Was this page helpful?