Tutorial: Automatically Tagging Your Data with an LLM

Create an index that uses a large language model to tag your data and add the tags to metadata. Labelling data with an LLM provides a fast, consistent, and scalable way of adding metadata to your files.

  • Level: Beginner
  • Time to complete: 15 to 20 minutes
  • Prerequisites:
    • A basic understanding of how indexes work. To learn more, see Indexes.
    • A deepset workspace where you'll upload the data and create the index. For instructions on how to create a workspace, see Quick Start Guide.
    • An API key for the deepset workspace. For information how to generate it, see Generate API Keys. (This can be a service key for a workspace.)
  • Goal: After completing this tutorial, you'll have created an index that uses an LLM to tag your data. The tags the LLM generates will be stored in each file's metadata so you can use them in your search app. You'll use an example dataset of emails to do this.

Upload Data

First, let's upload the data to the deepset workspace where we'll then create the auto-labelling index.

  1. Download the sample dataset and unpack it on your computer. This is a collection of emails you'll tag. You can also use your own files.

  2. Log in to deepset AI Platform, make sure you're in the right workspace, and go to Files.

  3. Click Upload Files, drag the files you unpacked in step 1 and drop them in the Upload Files window. (You must select all files in the folder.)

  4. Click Upload and wait until the files are in the workspace. It may take a while as you're uploading over 8,000 files.

As there are over 8 000 files to upload, you'll upload them with SDK CLI to make it faster.

  1. Open the terminal and run this command to install the deepset SDK:
    pip install deepset-cloud-sdk
  2. When the installation finishes with a success message, run:
    deepset-cloud login
  3. When prompted:
    1. Type eu as the environment.
    2. Paste your API key.
    3. Type the name of the workspace where you want to upload the files.
  4. Run this command to upload the files:
    deepset-cloud upload <path to the emails dataset> --write-mode OVERWRITE

For example, on a Mac, if you saved the unpacked folder in Downloads, the command would be:

deepset-cloud upload ~/Downloads/emails/

Result: You have uploaded all your files to your deepset workspace.

Create an Index

Once your data is in deepset, you can create an index that will be the starting point for the auto-labelling system. The index prepares files for search and writes them into a document store, where a query pipeline can then access them.

  1. Go to Indexes and click Create Index. This opens available index templates.

  2. Click the AI-Generated Metadata for Files index to use it.

  3. Leave the default index name and click Create Index. You land in Builder with the index open for editing.