- Level: Beginner
- Time to complete: 10 minutes
- You must be an Admin to complete this tutorial.
- The workspace where you want to upload the files must already be created in deepset Cloud. In this tutorial, we call the workspace hotel_reviews.
- Goal: After completing this tutorial, you will have uploaded a set of hotel reviews with metadata to a deepset Cloud workspace. You can replace this dataset with your custom one.
This tutorial uses a set of hotel reviews with some metadata in them. You can also use your own files; just make sure they're in the TXT or PDF format.
- Download the hotel reviews dataset.
- Extract the files to a folder called hotel_reviews in your Documents folder. This can take a couple of minutes.
Result: You have 5,956 files in the \Documents\hotel_reviews folder, 2978 TXT files and 2978 JSON files. Each TXT file is accompanied by a
.meta.json file containing the text file metadata.
- Open the command line and run:
pip install deepset-cloud-sdk
- Wait until the installation finishes with a success message.
Result: You have installed the deepset Cloud SDK. It comes with a command line interface that we'll use to upload the files.
- Log in to deepset Cloud.
- Click your name in the top right corner and select Connections.
- Under API Keys, click Add new key.
- Select the expiration date for your key and click Generate key.
- Copy the key and save it to a notepad.
- Click Add new key.
Result: You have an API key saved in a file. You can now use it to upload your files.
- Open the command line and run the following command to log in to deepset Cloud:
python -m deepset_cloud_sdk.cli login
- When prompted, paste your API key.
- Type the name of the deepset Cloud workspace where you want to upload the files. This creates an .env file with the information you just provided. The SDK uses the information from this file when uploading files.
- Run this command to upload files, including all the subfolders of the hotel_reviews folder and overwrite any files with the same name that might already exist in the workspace:
deepset-cloud upload <path_to_hotel_reviews_folder> --recursive --write-mode OVERWRITE
python -m deepset_cloud_sdk.cli upload <path_to_hotel_reviews_folder> --recursive --write-mode OVERWRITE
- Wait until the upload finishes succesfully. You should see this message:
5956 files are uploaded and half of them, 2978 are listed in deepset Cloud. (This is because the metadata files are not shown in deepset Cloud).
Result: You have uploaded all your files, including the ones from the subfolders. Let's now see if they're showing up in deepset Cloud.
In the command line, list the uploaded files by running:
python -m deepset_cloud_sdk.cli list-files
You should see a list of files with file ID, URL, name, size, metadata, and the date when it was created.
With the number of files we uploaded, it's easier to verify if they uploaded correctly in the deepset Cloud UI.
In deepset Cloud:
- Switch to the hotel_reviews workspace where you uploaded the files and click Dashboard. You can see it's showing there are 3K files in the workspace (it rounds the number up).
- In the left navigation, click Files. You can see that the total number of files is 2978.
Now, let's check if the metadata was uploaded.
One way to do this is to open a random file and then click View Metadata on the file preview.
Metadata shows up as search filters, so let's check if that's the case. You need a pipeline to run a search, so if you don't have one in this workspace, let's quickly create one:
Go to Pipelines > New Pipeline.
In YAML Editor, click Create Pipeline > In Empty File.
Copy the following pipeline and paste it into the YAML, replacing the current YAML contents:
version: '1.21.0' name: 'QuestionAnswering_en-test' components: - name: DocumentStore type: DeepsetCloudDocumentStore - name: Retriever type: EmbeddingRetriever params: document_store: DocumentStore embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 model_format: sentence_transformers top_k: 20 - name: Reader type: FARMReader params: model_name_or_path: deepset/deberta-v3-base-squad2 context_window_size: 700 - name: FileTypeClassifier type: FileTypeClassifier - name: TextConverter type: TextConverter - name: PDFConverter type: PDFToTextConverter - name: Preprocessor type: PreProcessor params: split_by: word split_length: 250 split_overlap: 30 split_respect_sentence_boundary: True language: en pipelines: - name: query nodes: - name: Retriever inputs: [Query] - name: Reader inputs: [Retriever] - name: indexing nodes: - name: FileTypeClassifier inputs: [File] - name: TextConverter inputs: [FileTypeClassifier.output_1] - name: PDFConverter inputs: [FileTypeClassifier.output_2] - name: Preprocessor inputs: [TextConverter, PDFConverter] - name: Retriever inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever]
Save the pipeline.
In the top right corner of the YAML editor, click Deploy.
Return to the Pipelines page and wait until your pipeline is deployed and indexed.
When the pipeline is indexed, click Search.
Select your pipeline, and you'll see all the metadata now available as search criteria:
Updated about 23 hours ago