S3Downloader

Download files from AWS S3 buckets to the local filesystem.

Basic Information

Type: haystack_integrations.components.downloaders.s3.s3_downloader.S3Downloader
Components it can connect with:
- MultiFileConverter: S3Downloader can send downloaded file paths to a converter.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		A list of documents with S3 file references in their metadata.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Documents with the `file_path` metadata updated to point to the downloaded local files.

Overview

Use S3Downloader to download files from Amazon S3 buckets in your pipeline. This component is useful when your documents reference files stored in S3 that need to be processed locally.

The component supports concurrent downloads and file caching to improve performance.

File Extension Filtering

You can use the file_extensions parameter to download only specific file types, reducing unnecessary downloads and processing time. For example, file_extensions=[".pdf", ".txt"] downloads only PDF and TXT files while skipping others.

Custom S3 Key Generation

By default, the component uses the file_name from Document metadata as the S3 key. If your S3 file structure doesn't match the file names in metadata, you can provide an optional s3_key_generation_function to customize how S3 keys are generated from Document metadata.

Authorization

You need AWS credentials to access S3 buckets. To connect Haystack Platform with your AWS account, create secrets with the following keys:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION

For details on how to create secrets, see Add Secrets

Usage Example

This is an example indexing pipeline with S3Downloader to download and process files from S3:

components:
  s3_downloader:
    type: haystack_integrations.components.downloaders.s3.s3_downloader.S3Downloader
    init_parameters:
      aws_access_key_id:
        type: env_var
        env_vars:
        - AWS_ACCESS_KEY_ID
        strict: false
      aws_secret_access_key:
        type: env_var
        env_vars:
        - AWS_SECRET_ACCESS_KEY
        strict: false
      aws_session_token:
        type: env_var
        env_vars:
        - AWS_SESSION_TOKEN
        strict: false
      aws_region_name:
        type: env_var
        env_vars:
        - AWS_DEFAULT_REGION
        strict: false
      aws_profile_name:
        type: env_var
        env_vars:
        - AWS_PROFILE
        strict: false
      boto3_config:
      file_root_path:
      file_extensions:
      - .pdf
      - .txt
      - .docx
      - .html
      file_name_meta_key: file_name
      max_workers: 32
      max_cache_size: 100
      s3_key_generation_function: deepset_cloud_custom_nodes.utils.storage.get_s3_key

  converter:
    type: haystack.components.converters.multi_file_converter.MultiFileConverter
    init_parameters:
      encoding: utf-8

  cleaner:
    type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
    init_parameters:
      remove_empty_lines: true
      remove_extra_whitespaces: true
      remove_repeated_substrings: false
      keep_id: false

  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: sentence
      split_length: 5
      split_overlap: 1
      split_threshold: 0

  document_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
    init_parameters:
      model: intfloat/e5-base-v2
      normalize_embeddings: true

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: 'default'
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

connections:
- sender: cleaner.documents
  receiver: splitter.documents
- sender: splitter.documents
  receiver: document_embedder.documents
- sender: document_embedder.documents
  receiver: writer.documents
- sender: converter.documents
  receiver: s3_downloader.documents
- sender: s3_downloader.documents
  receiver: cleaner.documents

max_runs_per_component: 100

metadata: {}

inputs:
  files:
  - converter.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
aws_access_key_id	Optional[Secret]	Secret.from_env_var('AWS_ACCESS_KEY_ID')	AWS access key ID.
aws_secret_access_key	Optional[Secret]	Secret.from_env_var('AWS_SECRET_ACCESS_KEY')	AWS secret access key.
aws_session_token	Optional[Secret]	Secret.from_env_var('AWS_SESSION_TOKEN')	AWS session token for temporary credentials.
aws_region_name	Optional[Secret]	Secret.from_env_var('AWS_DEFAULT_REGION')	AWS region name.
aws_profile_name	Optional[Secret]	Secret.from_env_var('AWS_PROFILE')	AWS profile name.
boto3_config	Optional[Dict[str, Any]]	None	Configuration for the boto3 client.
file_root_path	Optional[str]	None	The path where files are downloaded. Can be set through this parameter or the `FILE_ROOT_PATH` environment variable. If the specified directory doesn't exits, it's created.
file_extensions	Optional[List[str]]	None	File extensions permitted for download. By default, all file extensions are allowed.
file_name_meta_key	str	"file_name"	The metadata key that contains the file name to download.
max_workers	int	32	Maximum number of workers for concurrent downloads.
max_cache_size	int	100	Maximum number of files to cache.
s3_key_generation_function	Optional[Callable]	None	A function to generate S3 keys from documents.

Run Method Parameters

These are the parameters you can configure for the component's run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.

Parameter	Type	Default	Description
documents	List[Document]		A list of documents with S3 file references in their metadata.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

File Extension Filtering​

Custom S3 Key Generation​

Authorization​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​