DeepsetFileDownloader
Download files referenced by documents to the local filesystem. Use this component in visual question answering pipelines to fetch PDFs and images before conversion.
This component is deprecated. Use S3Downloader from the Haystack AWS integration instead. Existing pipelines that use this component continue to work for now.
Key Features
- Downloads files associated with documents from platform storage to local disk.
- Filters downloads by file extension.
- Caches downloaded files locally with a configurable cache size.
- Returns updated documents with
file_pathset in metadata. - Supports concurrent downloads for multiple files.
Configuration
- Drag the
DeepsetFileDownloadercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Set
file_extensionsto limit which file types are downloaded. - Configure
sources_target_typeandmax_cache_sizeas needed.
This component requires a warm-up step before it can download files. The platform handles this automatically when the pipeline runs.
Connections
DeepsetFileDownloader accepts documents or sources as input. It outputs documents with file_path metadata and sources in the configured target type.
Connect a Ranker or retriever that returns documents with file_id metadata to the input. Connect the documents or sources output to DeepsetPDFDocumentToBase64Image, DeepsetFileToBase64Image, or a visual Generator.
Usage Example
This example downloads PDF files before converting them to images:
components:
image_downloader:
type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
init_parameters:
file_extensions:
- .pdf
sources_target_type: str
max_cache_size: 100
pdf_to_image:
type: deepset_cloud_custom_nodes.converters.pdf_to_image.DeepsetPDFDocumentToBase64Image
init_parameters:
detail: auto
connections:
- sender: image_downloader.documents
receiver: pdf_to_image.documents
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | Optional[List[Document]] | None | Documents with file_id in metadata to download. |
| sources | Optional[List[Union[ByteStream, UUID, str]]] | None | File IDs or byte streams to download. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents with file_path set in metadata. | |
| sources | List[Union[str, Path, ByteStream]] | Downloaded file paths or byte streams in the configured target type. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| file_extensions | Optional[List[str]] | None | File extensions to download, such as [".pdf", ".png"]. If None, all file types are downloaded. |
| sources_target_type | Literal["str", "pathlib.Path", "haystack.dataclasses.ByteStream"] | str | Type of the sources returned in the output. |
| max_cache_size | int | 100 | Maximum number of files to keep in the local cache. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | Optional[List[Document]] | None | Documents with file_id in metadata to download. |
| sources | Optional[List[Union[ByteStream, UUID, str]]] | None | File IDs or byte streams to download. |
Was this page helpful?