S3Downloader
Download files from AWS S3 buckets to the local filesystem. Use this component in indexes when your documents reference files stored in S3 that need to be processed locally.
Key Features
- Downloads files from Amazon S3 buckets for pipeline processing.
- Supports concurrent downloads with a configurable number of workers.
- File extension filtering to download only specific file types.
- Caches downloaded files to improve performance on repeated runs.
- Supports custom S3 key generation functions for flexible file path mapping.
Configuration
To use this component, you need AWS credentials. Connect Haystack Platform to your AWS account by creating secrets with the following keys: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION.
For details on how to create secrets, see Add Secrets.
- Drag the
S3Downloadercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed. You can set the AWS credentials, file root path, file extensions, and other parameters directly or through environment variables.
Connections
S3Downloader accepts a list of documents with S3 file references in their metadata as input. It outputs a list of documents with the file_path metadata updated to point to the downloaded local files.
Connect a converter like MultiFileConverter to the documents input to provide file references. Connect the documents output to a cleaner or splitter for further processing.
Usage Example
This is an example indexing pipeline with S3Downloader to download and process files from S3:
components:
s3_downloader:
type: haystack_integrations.components.downloaders.s3.s3_downloader.S3Downloader
init_parameters:
aws_access_key_id:
type: env_var
env_vars:
- AWS_ACCESS_KEY_ID
strict: false
aws_secret_access_key:
type: env_var
env_vars:
- AWS_SECRET_ACCESS_KEY
strict: false
aws_session_token:
type: env_var
env_vars:
- AWS_SESSION_TOKEN
strict: false
aws_region_name:
type: env_var
env_vars:
- AWS_DEFAULT_REGION
strict: false
aws_profile_name:
type: env_var
env_vars:
- AWS_PROFILE
strict: false
boto3_config:
file_root_path:
file_extensions:
- .pdf
- .txt
- .docx
- .html
file_name_meta_key: file_name
max_workers: 32
max_cache_size: 100
s3_key_generation_function: deepset_cloud_custom_nodes.utils.storage.get_s3_key
converter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
cleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
keep_id: false
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
split_threshold: 0
document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: intfloat/e5-base-v2
normalize_embeddings: true
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'default'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
connections:
- sender: cleaner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: converter.documents
receiver: s3_downloader.documents
- sender: s3_downloader.documents
receiver: cleaner.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- converter.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents with S3 file references in their metadata. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents with the file_path metadata updated to point to the downloaded local files. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| aws_access_key_id | Optional[Secret] | Secret.from_env_var('AWS_ACCESS_KEY_ID') | AWS access key ID. |
| aws_secret_access_key | Optional[Secret] | Secret.from_env_var('AWS_SECRET_ACCESS_KEY') | AWS secret access key. |
| aws_session_token | Optional[Secret] | Secret.from_env_var('AWS_SESSION_TOKEN') | AWS session token for temporary credentials. |
| aws_region_name | Optional[Secret] | Secret.from_env_var('AWS_DEFAULT_REGION') | AWS region name. |
| aws_profile_name | Optional[Secret] | Secret.from_env_var('AWS_PROFILE') | AWS profile name. |
| boto3_config | Optional[Dict[str, Any]] | None | Configuration for the boto3 client. |
| file_root_path | Optional[str] | None | The path where files are downloaded. Can be set through this parameter or the FILE_ROOT_PATH environment variable. If the specified directory doesn't exits, it's created. |
| file_extensions | Optional[List[str]] | None | File extensions permitted for download. By default, all file extensions are allowed. |
| file_name_meta_key | str | "file_name" | The metadata key that contains the file name to download. |
| max_workers | int | 32 | Maximum number of workers for concurrent downloads. |
| max_cache_size | int | 100 | Maximum number of files to cache. |
| s3_key_generation_function | Optional[Callable] | None | A function to generate S3 keys from documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents with S3 file references in their metadata. |
Was this page helpful?