Upload Files

Upload your data to deepset Cloud. The files you upload are then turned into documents and indexed when you deploy your pipeline. The files must be in TXT or PDF format.

📘

Things to note:

  • You must be an Admin to perform this task.
  • Currently, deepset Cloud supports files encoded in UTF-8 format. For files with different encoding formats, you may experience issues with display or formatting.
  • A single file cannot be larger than 200 MB. If a file exceeds this size, it won'be be uploaded. You can use deepset Cloud SDK to upload larger files.
  • Whenever you add a file, it's preprocessed (indexed) by all deployed pipelines.

About This Task

Supported File Types

You can upload files of the following types:

  • TXT
  • PDF
  • DOCX
  • PPTX
  • XLSX
  • XML
  • CSV
  • HTML
  • MD
  • JSON

To preprocess most of these files, you can add Converter nodes to your indexing pipeline. For the file types for which a converter is unavailable, we recommend preprocessing the files outside of deepset Cloud. To learn more, see also PreProcessing Data with Pipeline Nodes.

Synchronous and Asynchronous Upload

You can upload your files in two ways: synchronous and asynchronous.

Synchronous upload:

  • Happens immediately, and you get direct feedback.
  • Is available through UI or an API endpoint.
  • Is relatively slow and not recommended for large amounts of files.

Asynchronous upload:

  • Uses sessions. You create a session and upload files to this session. Each session has an ID and you can check its status at any time. A session expires after 24 hours.
  • Is recommended for large amounts of files. The suggested limit on the number of files to upload in one session is 50,000 files.
  • Is faster than the synchronous upload, but it can take some time until the files are listed in deepset Cloud after they're uploaded. This means if you have deployed a pipeline, you may need to wait a while for it to run on the newly uploaded files.
  • Is available through API endpoints and SDK.

Choosing the Best Method

If you......choose
Just have a few files to upload and don't need to add metadata to themSynchronous upload from the UI.
Have more files to upload
Want to add metadata to your files
Need direct feedback about your upload
Don't mind using a slower method
Synchronous upload with a REST API endpoint.
Need to upload fast
Want to add metadata to your files
Have a lot of files to upload
Don't mind waiting a while until your files are indexed
Asynchronous upload.

Metadata

You can add metadata to your files by uploading them with the SDK or REST API. These metadata act as search filters at query time. To learn more about the metadata format and how to add and use it, see Working with Metadata.

If you're uploading files with metadata, the resulting Documents inherit the metadata from your files.

Prerequisites

  • Make sure you have a workspace to upload your files to. You can easily create one if needed:

      1. In deepset Cloud, click the workspace name in the upper left corner.

      2. Type the name of the workspace you want to create. You can create up to 10 workspaces.

        The workspace list expanded. At the bottom of the list there's an empty field where you can type the name of the new workspace and the Create button.

Upload Asynchronously

Use REST API endpoints or the SDK to create a session and upload your files.

There's no limit on the session size, but we recommend you upload 50,000 files per session at a maximum. When the upload finishes and the session is closed, you can check the status of your files and see the failed and uploaded files.

Preparing file metadata

To add metadata to your files, create one metadata file for each TXT or PDF file you upload. The metadata file must be a JSON file with the same name as the file whose metadata it contains and the extension meta.json.

For example, if you're uploading a file called example.txt, the metadata file should be called example.txt.meta.json. If you're uploading a file called example.pdf, the metadata file should be example.pdf.meta.json.

Here's the format of metadata in your *.meta.json files: {"meta_key1": "value1", "meta_key2": "value2"}.

Here's how the asynchronous upload works:

Upload with the REST API

First, you send a request to open a session. You get a URL and an authentication configuration as a response. You use the URL to upload your files. Then, you close the session. This is when your files are sent to deepset Cloud, and the upload is finished.

The upload starts after you close a session. This means you can use an open session to add more uploads. If you don't close your session, it's automatically closed after 24 hours.

  1. Generate an API Key. You need this to connect to deepset Cloud.
  2. (Optional) Prepare your metadata files. See Add Search Filters Through Metadata.
  3. Use the Create Upload Session API endpoint to open a session.
    Here's a step-by-step recipe for using the endpoint:
  1. Send your files to the URL you received in the response using the authentication configuration from the response. You can send both your metadata and raw files to this URL.
  2. When you sent all your files, close the session to start the upload. Here's the request explained step-by-step:
  1. Wait a while until your files are listed in deepset Cloud.
An example script to upload files in a session

In this scenario, you save the files locally first:

from dataclasses import dataclass
import os
from pathlib import Path
import time
from typing import Any, Dict, List, Tuple, Optional
from uuid import UUID
import asyncio
import aiofiles # for async file reading
import httpx
import structlog # for nice logs
from tqdm.asyncio import tqdm # to visualise progress of async task completion

DC_ENDPOINT = "https://api.cloud.deepset.ai/api/v1"
WORKSPACE_NAME="<your workspace>"
API_TOKEN="<your_token>"
log = structlog.get_logger(__name__)

def create_session() -> Tuple[UUID, Dict[Any, Any]]:
    response = httpx.post(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        json={},
        timeout=120,
    )
    assert response.status_code == 201
    return (
        UUID(response.json()["session_id"]),
        response.json()["aws_prefixed_request_config"],
    )


def close_session(session_id: UUID) -> None:
    response = httpx.put(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions/{session_id}",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        json={"status": "CLOSED"},
        timeout=120,
    )
    assert response.status_code == 204, f"status code should be '204', got '{response.status_code}' with content '{response.text}'"


@dataclass
class IngestionStatus:
    finished_files: int
    failed_files: int

def get_session_status(session_id: UUID) -> IngestionStatus:
    response = httpx.get(
        f"{DC_ENDPOINT}/workspaces/{WORKSPACE_NAME}/upload_sessions/{session_id}",
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        timeout=120,
    )
    assert response.status_code == 200
    response_body = response.json()
    print(response_body)
    return IngestionStatus(
        finished_files=response_body["ingestion_status"]["finished_files"],
        failed_files=response_body["ingestion_status"]["failed_files"],
    )

@dataclass
class UploadFileResult:
    file: str
    exception: Optional[Exception] = None

async def upload_file_to_s3(
    file_path: str,
    aws_prefixed_request_config: Dict[Any, Any],
    semaphore:asyncio.BoundedSemaphore, 
    concurrency=10
) -> List[UploadFileResult]:
    client = httpx.AsyncClient()
    # upload file asynchronously using prefixed request config
    async with semaphore:
        async with httpx.AsyncClient(
            limits=httpx.Limits(
                max_keepalive_connections=concurrency, max_connections=concurrency
            )
        ) as client:
            try:
                async with aiofiles.open(file_path, "rb") as file:
                    file_name = os.path.basename(file_path)
                    content = await file.read()
                    response = await client.post(
                        aws_prefixed_request_config["url"],
                        data=aws_prefixed_request_config["fields"],
                        files={"file": (file_name, content)},
                        timeout=2000,
                    )
                    assert response.status_code == 204, f"status code should be '204', got '{response.status_code}'"
            except Exception as exc:
                return UploadFileResult(file=file_name, exception=exc)

    return UploadFileResult(file=file_name)

def wait_for_files_to_be_ingested(file_paths, exceptions, session_id):
    total_uploaded_files = len(file_paths) - len(exceptions)
    total_processed_files = 0
    while total_processed_files < total_uploaded_files:
        session_status = get_session_status(session_id)
        log.info(
            "Polling status",
            failed_files=session_status.failed_files,
            finished_files=session_status.finished_files,
        )
        total_processed_files = (
            session_status.failed_files + session_status.finished_files
        )
        time.sleep(3)


#### the main flow ####
async def main():
    session_id, aws_prefixed_request_config = create_session()

    # get a list of files as below, alternatively give a list of explicit paths
    file_paths = [p for p in Path('./path/to/data/dir').glob('*')]
    concurrency = 10
    semaphore = asyncio.BoundedSemaphore(concurrency)
    tasks = []
    for file_path in file_paths:
        # upload files
        tasks.append(
            upload_file_to_s3(file_path, aws_prefixed_request_config, semaphore=semaphore, concurrency=concurrency)
        )

    results:List[UploadFileResult] = await tqdm.gather(*tasks)

    exceptions = [r for r in results if r.exception is not None]

    log.info(
        "files uploaded",
        successful=len(results) - len(exceptions),
        failed=len(exceptions),
    )

    if len(exceptions):
        log.warning("upload exceptions", exceptions=exceptions)

    # close session once you are done with uploading
    # the ingestion will start once the session is closed
    close_session(session_id)
    
    # wait for files to be ingested into deepsetCloud
    wait_for_files_to_be_ingested(file_paths, exceptions, session_id)

if __name__ == '__main__':
    asyncio.run(main())

You can check the status of your session with the Get Session Status API endpoint.

Upload with the SDK

This method is best if you have many files to upload. It uses sessions under the hood and makes it possible to add metadata to your files. You can use the command-line interface or Python methods:

See also our tutorials:

Upload Synchronously

🚧

Not recommended for a large number of files

If you have more than a few hundred files to upload, we recommend using the Python SDK or REST API. It's faster and more stable.

Choose the best option for you:

Upload from the UI

  1. In deepset Cloud, go to Files>Upload Files.
  2. Drag your files and folders to deepset Cloud.
  3. Click Upload. It may take a while for the files to be processed and displayed on the Files page.

Upload with the REST API

Here's the request that you can send to upload your files. For more information, you can also see the upload file endpoint documentation. You need to Generate an API Key first.

Here's a step-by-step code with explanations to upload files you already have:

Here's a step-by-step code with explanations to upload and create a file in a single request:

And here's a copiable example of both requests:

# This is an example request to send when you're uploading a file:

curl --request POST \
     --url https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE_NAME>/files \
     --header 'accept: application/json' \
     --header 'authorization: Bearer <YOUR_API_KEY>' \
     --header 'content-type: multipart/form-data' \
     --form 'meta={"key1":"value1", "key2":"value2"}' \
     --form file=@<YOUR_FILE.PDF>
     
# This is an example request if you're creating the file during upload:
curl --request POST \
     --url 'https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE_NAME>/files?file_name=myFile.txt' \
     --header 'accept: application/json' \
     --header 'authorization: Bearer <YOUR_API_KEY>' \
     --header 'content-type: multipart/form-data' \
     --form 'meta={"key1":"value1", "key2":"value2"}' \
     --form 'text=This is the file text'