# Upload Files with Python

Use this method if you have many files to upload or want to upload files with metadata.

***

## About This Task

Uploading files using the Python methods included in the SDK is asynchronous and uses sessions under the hood. It's best for uploading large numbers of files with metadata. To upload files using this method, create and run a Python script. You can find example scripts at the bottom of this page.

To learn more, see also [Upload Files](/docs/how-to-guides/working-with-your-data/upload-files.mdx) and [Working with Metadata](/docs/how-to-guides/working-with-your-data/working-with-metadata/use-metadata-in-your-search-system.mdx).

### Sessions

<AsyncUploadInfo />

### Folder Structure

<UploadFolderStructureInfo />

### File Extensions

<FileExtensionWarning />

## Prerequisites

<SdkInstallationSteps />

## Upload Scripts Examples

### Upload From a Folder

Here is an example of a synchronous and asynchronous way to upload files from a folder.

**Note**: When using Jupyter Notebooks, use this import before loading the SDK: 

```python
nest_asyncio.apply()

from deepset_cloud_sdk.workflows.sync_client.files import list_files

# you can install it with 
pip install nest-asyncio
```
Switch between the tabs to check the sync and async example:

<Tabs>
  <TabItem value="sync" label="Sync">
  
  ```python
    from pathlib import Path

    from deepset_cloud_sdk.workflows.sync_client.files import upload

    # Uploads all files from a given path
    upload(
        paths=[Path("<your_path_to_the_upload_folder>")],
        api_key="<deepset_API_key>",
        workspace_name="<default_workspace>",
        blocking=True,  # waits until the files are displayed in deepset,
        # this may take a couple of minutes
        timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
        show_progress=True,  # shows the progress bar
        recursive=True,  # uploads files from all subfolders as well
)
  ```

  </TabItem>
  <TabItem value="async" label="Async">
  
  ```python
    from pathlib import Path

    from deepset_cloud_sdk.workflows.async_client.files import upload

    # Uploads all files from a given path.
    async def my_async_context() -> None:
        await upload(
            paths=[Path("<your_path_to_the_upload_folder>")],
            api_key="<deepsetCloud_API_key>",
            workspace_name="<default_workspace>",
            blocking=True,  # waits until the files are displayed in deepset Cloud,
            # this may take a couple of minutes
            timeout_s=300,  # the timeout for the `blocking` parameter in number of seconds
            show_progress=True,  # shows the progress bar
            recursive=True,  # uploads files from all subfolders as well
    )

# Run the async function
if __name__ == "__main__":
    asyncio.run(my_async_context())
  ```

  </TabItem>
</Tabs>

### Upload Bytes

You can upload files as bytes to a <ProductShortName /> workspace. This method is suitable for all file types. Here are examples of a synchronous and an asynchronous way to do this:

<Tabs>
  <TabItem value="sync" label="Sync">
  
  ```python
  from deepset_cloud_sdk.workflows.sync_client.files import upload_bytes, DeepsetCloudFileBytes

    upload_bytes(
        api_key="<deepsetCloud_API_key>",
        workspace_name="<default_workspace>", # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
        files=[
            DeepsetCloudFileBytes(
                name="example.txt",
                file_bytes=b"this is text",
                meta={"key": "value"},  # optional
        )
    ],
        blocking=True,  # optional, by default True
        timeout_s=300,  # optional, by default 300
)
  ```

  </TabItem>
  <TabItem value="async" label="Async">
  
  ```python
    from deepset_cloud_sdk.workflows.async_client.files import upload_bytes, DeepsetCloudFileBytes

    async def my_async_context() -> None:
        await upload_bytes(
            api_key="<deepsetCloud_API_key>",
            workspace_name="<default_workspace>",  # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
            files=[
            DeepsetCloudFileBytes(
                name="example.txt",
                file_bytes=b"this is some byte text",
                meta={"key": "value"},  # optional
            )
            ],
            blocking=True,  # optional, by default True
            timeout_s=300,  # optional, by default 300
        )

    # Run the async function
    if __name__ == "__main__":
        asyncio.run(my_async_context())
    ```

  </TabItem>
</Tabs>

### Synchronize GitHub Files with <ProductName />

Here's an example script to load TXT and MD files from GitHub and send them to <ProductName />. It fetches the content as texts from GitHub and forwards them to <ProductShortName />.

```python
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key here
API_KEY: str = "<YOUR-API-KEY>"

def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename

def fetch_and_prepare_files(urls: List[str]) -> List[DeepsetCloudFile]:
    """Fetches files from URLs and converts them to DeepsetCloudFile objects.

    These Objects can be uploaded to the Haystack Enterprise Platform directly without
    having to first copy them to disk.

    :param urls: List of URLs to fetch files from
    :return: List of DeepsetCloudFile objects
    """
    files_to_upload: List[DeepsetCloudFile] = []

    for url in urls:
        response = httpx.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )
        files_to_upload.append(file)

    return files_to_upload

# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

files = fetch_and_prepare_files(DOWNLOAD_URLS)

# Upload .txt and .md files to deepset Cloud
upload_texts(
    workspace_name="upload-test-123",  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
    files=files,
    blocking=False,  # Set to False for non-blocking uploads
    timeout_s=300,  # Optional, default is 300 seconds
    show_progress=True,  # Optional, default is True
    api_key=API_KEY,
    write_mode=WriteMode.OVERWRITE,
)

```

### Download Files from a URL and Upload to <ProductName />

#### Using Threading

This script downloads TXT and MD files from the URL you specify and then uploads them to <ProductName /> using threading. Note that the maximum concurrency (processes in `multiprocessing.pool(processes=3)` ) is limited by the amount of cores in your system. For maximum utilization, you can use `multiprocessing.cpu_count()` to set the number of processes.

```python
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import upload_texts, WriteMode, DeepsetCloudFile

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"

def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename

def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Haystack Enterprise Platform directly without
    having to first copy them to disk.

    :param url: URL to fetch files from
    """
    response = httpx.get(url)
    response.raise_for_status()

    file = DeepsetCloudFile(
        text=response.text,
        name=_parse_filename(url),
        meta={"url": url},
    )

    upload_texts(
        workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" by default
        files=[file],
        blocking=False,  # Set to False for non-blocking uploads
        timeout_s=300,  # Optional, default is 300 seconds
        show_progress=True,  # Optional, default is True
        api_key=API_KEY,
        write_mode=WriteMode.OVERWRITE,
    )

# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

if __name__ == '__main__':
    # Upload .txt and .md files to deepset Cloud
    # Start one thread per URL to download and upload the files
    with multiprocessing.Pool(processes=3) as pool:
        results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)
```

#### Async

This example downloads files from a URL you specify and then uploads them asynchronously to <ProductName />:

```python
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import WriteMode, DeepsetCloudFile
from deepset_cloud_sdk.workflows.async_client.files import upload_texts

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"

def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename

async def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Haystack Enterprise Platform directly without
    having to first copy them to disk.

    :param url: URL to fetch file from
    """
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()

        file = DeepsetCloudFile(
            text=response.text,
            name=_parse_filename(url),
            meta={"url": url},
        )

        await upload_texts(
            workspace_name=WORKSPACE,  # optional, uses "DEFAULT_WORKSPACE_NAME" environment variable by default
            files=[file],
            blocking=False,  # Set to False for non-blocking uploads
            timeout_s=300,  # Optional, default is 300 seconds
            show_progress=True,  # Optional, default is True
            api_key=API_KEY,
            write_mode=WriteMode.OVERWRITE,
        )

async def main(urls: List[str]) -> None:
    """Main function to run the asynchronous fetching and uploading of files."""
    tasks = [fetch_and_upload_file(url) for url in urls]
    await asyncio.gather(*tasks)

# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
]

# Run the main function
if __name__ == "__main__":
    asyncio.run(main(DOWNLOAD_URLS))
```

#### From Memory in Byte Format

Here's an example of how to fetch a PDF file from a given URL, convert it to a byte format, and then upload it to <ProductName />:

```python
from typing import List
from urllib.parse import urlparse

from deepset_cloud_sdk.workflows.sync_client.files import (
    DeepsetCloudFileBytes,
    WriteMode,
    upload_bytes,
)

# Place your API key and workspace name here
API_KEY: str = "<YOUR-API-KEY>"
WORKSPACE: str = "<YOUR-WORKSPACE-NAME>"

def _parse_filename(url: str) -> str:
    """Parses the filename from a URL.

    :param url: URL to parse the filename from
    :return: Filename
    """
    path = urlparse(url).path
    filename = path.split("/")[-1]
    return filename

def fetch_and_upload_file(url: str) -> None:
    """Fetches a file from the given URL and converts it to a DeepsetCloudFile object.

    That Object can be uploaded to the Haystack Enterprise Platform directly without
    having to first copy them to disk.

    :param url: URL to fetch files from
    """
    response = httpx.get(url)
    response.raise_for_status()

    file = DeepsetCloudFileBytes(
        file_bytes=response.content,
        name=_parse_filename(url),
        meta={"url": url},
    )

    upload_bytes(
        workspace_name=WORKSPACE,  # optional, by default the environment variable "DEFAULT_WORKSPACE_NAME" is used
        files=[file],
        # by default blocking=True - by setting to False it will mean that you can immediately
        # continue uploading another batch of files
        blocking=False,
        timeout_s=300,  # optional, by default 300
        show_progress=True,  # optional, by default True
        api_key=API_KEY,
        write_mode=WriteMode.OVERWRITE,
    )

# URLs of files to download and upload
DOWNLOAD_URLS: List[str] = [
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example.txt",
    "https://raw.githubusercontent.com/deepset-ai/deepset-cloud-sdk/main/test-upload/example2.txt",
    "https://raw.githubusercontent.com/deepset-ai/haystack/main/README.md",
    "https://sherlock-holm.es/stories/pdf/letter/1-sided/advs.pdf",
]

if __name__ == '__main__':
    # Upload .txt and .pdf files to deepset Cloud
    # Start one thread per URL to download and upload the files
    with multiprocessing.Pool(processes=4) as pool:
        results = pool.map(fetch_and_upload_file, DOWNLOAD_URLS)

```

### Google Colab Notebook

Here's a Colab notebook with different upload scenarios you can test: [Upload files with SDK in Google Colab](https://colab.research.google.com/drive/1y2KMB606h-57BafCkhuiaXFWo4gDKtG3).
