Upload Files

Upload your data to deepset Cloud. The files you upload are then turned into documents and indexed when you deploy your pipeline. The files must be in TXT or PDF format.

📘

Things to note:

  • You must be an Admin to perform this task.
  • Currently, deepset Cloud supports files encoded in UTF-8 format. For files with different encoding formats, you may experience issues with display or formatting.
  • A single file cannot be larger than 200 MB. If a file exceeds this size, it won'be be uploaded. You can use deepset Cloud SDK to upload larger files.
  • Whenever you add a file, it's preprocessed (indexed) by all deployed pipelines.

Prerequisites

  • Make sure you have a workspace to upload your files to. You can easily create one if needed:

      1. In deepset Cloud, click the workspace name in the upper left corner.

      2. Type the name of the workspace you want to create. You can create up to 10 workspaces.

        The workspace list expanded. At the bottom of the list there's an empty field where you can type the name of the new workspace and the Create button.

Synchronous and Asynchronous Upload

There are two ways in which you can upload your files: synchronous and asynchronous.

Synchronous upload:

  • Happens immediately, and you get direct feedback.
  • Is available through UI or an API endpoint.
  • Is relatively slow and not recommended for large amounts of files.

Asynchronous upload:

  • Uses sessions. You create a session and upload files to this session. Each session has an ID and you can check its status at any time. A session expires after 24 hours.
  • Is recommeneded for large amounts of files. The suggested limit on the number of files to upload in one session is 50,000 files.
  • Is faster than the synchronous upload, but it can take some time until the files are listed in deepset Cloud after they're uploaded. This means if you have deployed a pipeline, you may need to wait a while for it to run on the newly uploaded files.
  • Is available through API endpoints and SDK.

Choosing the Best Method

If you......choose
Just have a few files to upload and don't need to add metadata to themSynchronous upload from the UI.
Have more files to upload
Want to add metadata to your files
Need direct feedback about your upload
Don't mind using a slower method
Synchronous upload with a REST API endpoint.
Need to upload fast
Want to add metadata to your files
Have a lot of files to upload
Don't mind waiting a while until your files are indexed
Asynchronous upload.

Metadata

You can add metadata to your files if you upload them with the SDK or REST API. These metadata act as search filters at query time. To learn more, see Add Search Filters.

If you're uploading files with metadata, the resulting Documents inherit the metadata from your files.

Upload Asynchronously

Use REST API endpoints or the SDK to create a session and upload your files.

There's no limit on the session size, but we recommend you upload 50,000 files per session at a maximum. When the upload finishes and the session is closed, you can check the status of your files and see the failed and uploaded files.

Preparing file metadata

To add metadata to your files, create one metadata file for each TXT or PDF file you upload. The metadata file must be a JSON file with the same name as the file whose metadata it contains and the extension meta.json.

For example, if you're uploading a file called example.txt, the metadata file should be called example.txt.meta.json. If you're uploading a file called example.pdf, the metadata file should be example.pdf.meta.json.

Here's the format of metadata in your *.meta.json files: {"meta_key1": "value1", "meta_key2": "value2"}.

Here's how the asynchronous upload works:

Upload with the REST API

First, you send a request to open a session. You get a URL and an authentication configuration as a response. You use the URL to upload your files. Then, you close the session. This is when your files are sent to deepset Cloud, and the upload is finished.

The upload starts after you close a session. This means you can use an open session to add more uploads. If you don't close your session, it's automatically closed after 24 hours.

  1. Generate an API Key. You need this to connect to deepset Cloud.
  2. (Optional) Prepare your metadata files. See Add Search Filters Through Metadata.
  3. Use the Create Upload Session API endpoint to open a session.
    Here's a step-by-step recipe for using the endpoint:
  1. Send your files to the URL you received in the response using the authentication configuration from the response. You can send both your metadata and raw files to this URL.
  2. When you sent all your files, close the session to start the upload. Here's the request explained step-by-step:
  1. Wait a while until your files are listed in deepset Cloud.
An example script to upload files in a session

In this scenario, you save the files locally first:

from dataclasses import dataclass
import os
from pathlib import Path
import time
from typing import Any, Dict, List, Tuple, Optional
from uuid import UUID
import asyncio
import aiofiles # for async file reading
import httpx
import structlog # for nice logs
from tqdm.asyncio import tqdm # to visualise progress of async task completion

DC_ENDPOINT = "https://api.cloud.deepset.ai/api/v1"
WORKSPACE_NAME="<your workspace>"
API_TOKEN="<your_token>"
log = structlog.get_logger(__name__)

def create_session() -> Tuple[UUID, Dict[Any, Any]]:
    response = httpx.post(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        json={},
        timeout=120,
    )
    assert response.status_code == 201
    return (
        UUID(response.json()["session_id"]),
        response.json()["aws_prefixed_request_config"],
    )


def close_session(session_id: UUID) -> None:
    response = httpx.put(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions/{session_id}",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        json={"status": "CLOSED"},
        timeout=120,
    )
    assert response.status_code == 204, f"status code should be '204', got '{response.status_code}' with content '{response.text}'"


@dataclass
class IngestionStatus:
    finished_files: int
    failed_files: int

def get_session_status(session_id: UUID) -> IngestionStatus:
    response = httpx.get(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions/{session_id}",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        timeout=120,
    )
    assert response.status_code == 200
    response_body = response.json()
    print(response_body)
    return IngestionStatus(
        finished_files=response_body["ingestion_status"]["finished_files"],
        failed_files=response_body["ingestion_status"]["failed_files"],
    )

@dataclass
class UploadFileResult:
    file: str
    exception: Optional[Exception] = None

async def upload_file_to_s3(
    file_path: str,
    aws_prefixed_request_config: Dict[Any, Any],
    semaphore:asyncio.BoundedSemaphore, 
    concurrency=10
) -> List[UploadFileResult]:
    client = httpx.AsyncClient()
    # upload file asynchronously using prefixed request config
    async with semaphore:
        async with httpx.AsyncClient(
            limits=httpx.Limits(
                max_keepalive_connections=concurrency, max_connections=concurrency
            )
        ) as client:
            try:
                async with aiofiles.open(file_path, "rb") as file:
                    file_name = os.path.basename(file_path)
                    content = await file.read()
                    response = await client.post(
                        aws_prefixed_request_config["url"],
                        data=aws_prefixed_request_config["fields"],
                        files={"file": (file_name, content)},
                        timeout=2000,
                    )
                    assert response.status_code == 204, f"status code should be '204', got '{response.status_code}'"
            except Exception as exc:
                return UploadFileResult(file=file_name, exception=exc)

    return UploadFileResult(file=file_name)

def wait_for_files_to_be_ingested(file_paths, exceptions, session_id):
    total_uploaded_files = len(file_paths) - len(exceptions)
    total_processed_files = 0
    while total_processed_files < total_uploaded_files:
        session_status = get_session_status(session_id)
        log.info(
            "Polling status",
            failed_files=session_status.failed_files,
            finished_files=session_status.finished_files,
        )
        total_processed_files = (
            session_status.failed_files + session_status.finished_files
        )
        time.sleep(3)


#### the main flow ####
async def main():
    session_id, aws_prefixed_request_config = create_session()

    # get a list of files as below, alternatively give a list of explicit paths
    file_paths = [p for p in Path('./path/to/data/dir').glob('*')]
    concurrency = 10
    semaphore = asyncio.BoundedSemaphore(concurrency)
    tasks = []
    for file_path in file_paths:
        # upload files
        tasks.append(
            upload_file_to_s3(file_path, aws_prefixed_request_config, semaphore=semaphore, concurrency=concurrency)
        )

    results:List[UploadFileResult] = await tqdm.gather(*tasks)

    exceptions = [r for r in results if r.exception is not None]

    log.info(
        "files uploaded",
        successful=len(results) - len(exceptions),
        failed=len(exceptions),
    )

    if len(exceptions):
        log.warning("upload exceptions", exceptions=exceptions)

    # close session once you are done with uploading
    # the ingestion will start once the session is closed
    close_session(session_id)
    
    # wait for files to be ingested into deepsetCloud
    wait_for_files_to_be_ingested(file_paths, exceptions, session_id)

if __name__ == '__main__':
    asyncio.run(main())

You can check the status of your session with the Get Session Status API endpoint.

Upload with the SDK

This method is best if you have many files to upload. It uses sessions under the hood and makes it possible to add metadata to your files. The easiest way to upload files is to use our open source deepset Cloud SDK package. It comes with a command-line interface (CLI), a set of examples, and a documentationto help you get started.

  1. Install the SDK.
  2. Generate an API Key for accessing deepset Cloud.
  3. (Optional) Prepare your metadata files. See Add Search Filters Through Metadata.
  4. Choose how you want to upload your files and run an appropriate command:
  • To upload files using the SDK CLI:

      1. Log in to the SDK. This creates an .ENV file with your API key for accessing deepset Cloud and the default workspace for all operations.
      2. Run the following command to upload the files:
        deepset-cloud upload <path_to_upload_folder> 
        
        python -m deepset_cloud_sdk.cli upload <path_to_upload_folder> 
        
  • To upload files from a folder using a Python method, create a script and run it. You can pass the API key and default workspace directly in the upload() method.
    Here's an example script you can use:

    from pathlib import Path
    
    from deepset_cloud_sdk.workflows.sync_client.files import upload
    
    ## Uploads all files from a given path.
    upload(
        paths=[Path("<your_path_to_the_upload_folder>")],
        api_key="<deepsetCloud_API_key>",
        workspace_name="<default_workspace>",
        blocking=True,  # waits until the files are displayed in deepset Cloud,
        # this may take a couple of minutes
        timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
        show_progress=True,  # shows the progress bar
        recursive=True,  # uploads files from all subfolders as well
    )
    
    

Also check the tutorials that explain how to upload files step-by-step: Tutorial: Uploading Files with CLI and Tutorial: Uploading Files with Python Methods.

An example script for uploading with SDK without saving the files locally
from typing import List
from deepset_cloud_sdk.workflows.sync_client.files import upload, upload_texts, WriteMode, DeepsetCloudFile
import httpx
from urllib.parse import urlparse

api_key="<YOUR-API-KEY>"

# Update this to create names that fit your use case
def parse_filename(url):
    name = f"{urlparse(url).path.rsplit('/')[-1]}"
    extension = name.split(".")[1]

    # deepset cloud only accepts .txt and .pdf files
    if extension not in ["txt", "pdf"]:
        name = f"{name}.txt"
    return name


def get_files(urls):
    upload_files: List[DeepsetCloudFile] = []
    for url in urls:
        response = httpx.get(url)

        # example metadata
        meta = {
            "url": url,
        }

        # you need to add the download content here
        file = DeepsetCloudFile(
            text=response.content, # the type of response.content is 'bytes'
            name=parse_filename(url),
            meta=meta
        )
        upload_files.append(file)

    return upload_files

# some text files, a markdown file, and a pdf file
download_urls=[
    'https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt',
    'https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt',
    'https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md',
    'https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf'
    ]

files = get_files(download_urls)

# A misleading method name, but you can use this to upload pdf and txt files
upload_texts(
    workspace_name="upload-test-123",  # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
    files=files,
    # by default blocking=True - by setting to False it will mean that you can immediately
    # continue uploading another batch of files
    # You can still validate they are uploaded via the get_upload_session function in the SDK if required, but it may take a
    # few minutes.
    blocking=False,
    timeout_s=300,  # optional, by default 300
    show_progress=True,  # optional, by default True
    api_key=api_key,
    write_mode=WriteMode.OVERWRITE
)

Upload Synchronously

🚧

Not recommended for a large number of files

If you have more than a few hundred files to upload, we recommend using the Python SDK or REST API. It's faster and more stable.

Choose the best option for you:

Upload from the UI

  1. In deepset Cloud, go to Files>Upload Files.
  2. Drag your files and folders to deepset Cloud. You can upload PDF and TXT files.
  3. Click Upload. It may take a while for the files to be processed and displayed on the Files page.

Upload with the REST API

Here's the request that you can send to upload your files. For more information, you can also see the upload file endpoint documentation. You need to Generate an API Key first.

Here's a step-by-step code with explanations to upload files you already have:

Here's a step-by-step code with explanations to upload and create a file in a single request:

And here's a copiable example of both requests:

# This is an example request to send when you're uploading a file:

curl --request POST \
     --url https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE_NAME>/files \
     --header 'accept: application/json' \
     --header 'authorization: Bearer <YOUR_API_KEY>' \
     --header 'content-type: multipart/form-data' \
     --form 'meta={"key1":"value1", "key2":"value2"}' \
     --form file=@<YOUR_FILE.PDF>
     
# This is an example request if you're creating the file during upload:
curl --request POST \
     --url 'https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE_NAME>/files?file_name=myFile.txt' \
     --header 'accept: application/json' \
     --header 'authorization: Bearer <YOUR_API_KEY>' \
     --header 'content-type: multipart/form-data' \
     --form 'meta={"key1":"value1", "key2":"value2"}' \
     --form 'text=This is the file text'