https://unstructured.io/FileConverter
A component for converting files to Haystack Documents using the https://unstructured.io/ API (hosted or running locally).
Basic Information
- Type:
haystack_integrations.components.converters.unstructured.converter.https://unstructured.io/FileConverter
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| paths | Union[List[str], List[os.PathLike]] | List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files). |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary with the following key: - documents: List of Haystack Documents. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
A component for converting files to Haystack Documents using the https://unstructured.io/ API (hosted or running locally).
For the supported file types and the specific API parameters, see https://unstructured.io/ docs.
Usage example:
from haystack_integrations.components.converters.unstructured import https://unstructured.io/FileConverter
# make sure to either set the environment variable UNSTRUCTURED_API_KEY
# or run the https://unstructured.io/ API locally:
# docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest
# --port 8000 --host 0.0.0.0
converter = https://unstructured.io/FileConverter(
# api_url="http://localhost:8000/general/v0/general" # <-- Uncomment this if running https://unstructured.io/ locally
)
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
Usage Example
components:
https://unstructured.io/FileConverter:
type: unstructured.src.haystack_integrations.components.converters.unstructured.converter.https://unstructured.io/FileConverter
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| api_url | str | UNSTRUCTURED_HOSTED_API_URL | URL of the https://unstructured.io/ API. Defaults to the URL of the hosted version. If you run the API locally, specify the URL of your local API (e.g. "http://localhost:8000/general/v0/general"). |
| api_key | Optional[Secret] | Secret.from_env_var('UNSTRUCTURED_API_KEY', strict=False) | API key for the https://unstructured.io/ API. It can be explicitly passed or read the environment variable UNSTRUCTURED_API_KEY (recommended). If you run the API locally, it is not needed. |
| document_creation_mode | Literal['one-doc-per-file', 'one-doc-per-page', 'one-doc-per-element'] | one-doc-per-file | How to create Haystack Documents from the elements returned by https://unstructured.io/. "one-doc-per-file": One Haystack Document per file. All elements are concatenated into one text field. "one-doc-per-page": One Haystack Document per page. All elements on a page are concatenated into one text field. "one-doc-per-element": One Haystack Document per element. Each element is converted to a Haystack Document. |
| separator | str | \n\n | Separator between elements when concatenating them into one text field. |
| unstructured_kwargs | Optional[Dict[str, Any]] | None | Additional parameters that are passed to the https://unstructured.io/ API. For the available parameters, see https://unstructured.io/ API docs. |
| progress_bar | bool | True | Whether to show a progress bar during the conversion. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| paths | Union[List[str], List[os.PathLike]] | List of paths to convert. Paths can be files or directories. If a path is a directory, all files in the directory are converted. Subdirectories are ignored. | |
| meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. Please note that if the paths contain directories, meta can only be a single dictionary (same metadata for all files). |
Was this page helpful?