Tutorial: Uploading Files with Metadata through SDK CLI

Learn to quickly upload large amounts of files with metadata. In this tutorial, you'll upload a set of hotel reviews but you can replace these files with your own. You will use the SDK package with a command-line.

  • Level: Beginner
  • Time to complete: 10 minutes
  • Prerequisites:
    • You must be an Admin to complete this tutorial.
    • The workspace where you want to upload the files must already be created in deepset Cloud. In this tutorial, we call the workspace hotel_reviews.
  • Goal: After completing this tutorial, you will have uploaded a set of hotel reviews with metadata to a deepset Cloud workspace. You can replace this dataset with your custom one.

Prepare Your Files

This tutorial uses a set of hotel reviews with some metadata in them. You can also use your own files; just make sure they're in the TXT or PDF format.

  1. Download the hotel reviews dataset.
  2. Extract the files to a folder called hotel_reviews in your Documents folder. This can take a couple of minutes.

Result: You have 5,956 files in the \Documents\hotel_reviews folder, 2978 TXT files and 2978 JSON files. Each TXT file is accompanied by a .meta.json file containing the text file metadata.

Install the SDK

  1. Open the command line and run:
    pip install deepset-cloud-sdk
    
  2. Wait until the installation finishes with a success message.

Result: You have installed the deepset Cloud SDK. It comes with a command line interface that we'll use to upload the files.

Obtain the API Key

  1. Log in to deepset Cloud.
  2. Click your name in the top right corner and select Connections.
  3. Under API Keys, click Add new key.
  4. Select the expiration date for your key and click Generate key.
  5. Copy the key and save it to a notepad.
  6. Click Add new key.

Result: You have an API key saved in a file. You can now use it to upload your files.

Upload Files

  1. Open the command line and run the following command to log in to deepset Cloud:
    deepset-cloud login
    
    python -m deepset_cloud_sdk.cli login
    
  2. When prompted, paste your API key.
  3. Type the name of the deepset Cloud workspace where you want to upload the files. This creates an .env file with the information you just provided. The SDK uses the information from this file when uploading files.
  4. Run this command to upload files, including all the subfolders of the hotel_reviews folder and overwrite any files with the same name that might already exist in the workspace:
deepset-cloud upload <path_to_hotel_reviews_folder> --recursive --write-mode OVERWRITE 
python -m deepset_cloud_sdk.cli upload <path_to_hotel_reviews_folder> --recursive --write-mode OVERWRITE
  1. Wait until the upload finishes succesfully. You should see this message:
    Subsequent info messages informing that the upload of 5956 files was successful and that 2978 files are listed in deepset Cloud
    5956 files are uploaded and half of them, 2978 are listed in deepset Cloud. (This is because the metadata files are not shown in deepset Cloud).

Result: You have uploaded all your files, including the ones from the subfolders. Let's now see if they're showing up in deepset Cloud.

Verify the Upload

  • In the command line, list the uploaded files by running:

    deepset-cloud list-files
    
    python -m deepset_cloud_sdk.cli list-files
    

    You should see a list of files with file ID, URL, name, size, metadata, and the date when it was created.

    A list of files with detailed information for each file.
    With the number of files we uploaded, it's easier to verify if they uploaded correctly in the deepset Cloud UI.

  • In deepset Cloud:

    1. Switch to the hotel_reviews workspace where you uploaded the files and click Dashboard. You can see it's showing there are 3K files in the workspace (it rounds the number up).
The deepset Cloud dashboard with numbered arrows indicating what to click step by step. The first arrow points to the icon for switching the workspace. The second arrow points to the Dashboard option, and the third arrow points to the number of files displayed in the Workspace Statistics section.
  1. In the left navigation, click Files. You can see that the total number of files is 2978.
The Files page in deepset Cloud showing how many files were uploaded.
  • Now, let's check if the metadata was uploaded.

    • One way to do this is to open a random file and then click View Metadata on the file preview.

    • Metadata shows up as search filters, so let's check if that's the case. You need a pipeline to run a search, so if you don't have one in this workspace, let's quickly create one:

      1. Go to Pipelines > New Pipeline.

      2. In YAML Editor, click Create Pipeline > In Empty File.

      3. Copy the following pipeline and paste it into the YAML, replacing the current YAML contents:

        version: '1.21.0'
        name: 'QuestionAnswering_en-test'
        
        components:
          - name: DocumentStore
            type: DeepsetCloudDocumentStore 
          - name: Retriever 
            type: EmbeddingRetriever 
            params:
              document_store: DocumentStore
              embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 
              model_format: sentence_transformers
              top_k: 20 
          - name: Reader 
            type: FARMReader 
            params:
              model_name_or_path: deepset/deberta-v3-base-squad2 
              context_window_size: 700 
          - name: FileTypeClassifier 
            type: FileTypeClassifier
          - name: TextConverter 
            type: TextConverter
          - name: PDFConverter 
            type: PDFToTextConverter
          - name: Preprocessor 
            type: PreProcessor
            params:
              split_by: word 
              split_length: 250 
              split_overlap: 30 
              split_respect_sentence_boundary: True 
              language: en 
        
        pipelines:
          - name: query
            nodes:
              - name: Retriever
                inputs: [Query]
              - name: Reader
                inputs: [Retriever]
          - name: indexing
            nodes:
              - name: FileTypeClassifier
                inputs: [File]
              - name: TextConverter
                inputs: [FileTypeClassifier.output_1] 
              - name: PDFConverter
                inputs: [FileTypeClassifier.output_2] 
              - name: Preprocessor
                inputs: [TextConverter, PDFConverter]
              - name: Retriever
                inputs: [Preprocessor]
              - name: DocumentStore
                inputs: [Retriever]
        
        
      4. Save the pipeline.

      5. In the top right corner of the YAML editor, click Deploy.

      6. Return to the Pipelines page and wait until your pipeline is deployed and indexed.

      7. When the pipeline is indexed, click Search.

      8. Select your pipeline, and you'll see all the metadata now available as search criteria:

The hotel_reviews workspace with the search open and the search filters displayed.

Related Links