To evaluate a pipeline, you need an evaluation dataset. Such a dataset contains the queries and the expected answers. During experiments, deepset Cloud runs the queries through your pipeline and compares the answers it returns to the gold answers from the dataset.

Labeling projects are a means to create an annotated dataset that you can then use to run experiments on your pipeline and evaluate its performance. An admin creates a labeling project and configures all its settings. The admin then invites the labelers to annotate the results.

The labelers can't edit the project settings. They can run queries and label the results.

Labeling Projects for Document Retrieval

🚧
Currently, deepset Cloud supports labeling projects for document search systems only.

Setting up a labeling project for a document pipeline involves:

Creating a pipeline that will be used to pre-filter documents.
To annotate data for document retrieval, the labelers use the Search. They ask a query and get a list of documents as a result. Under the hood, it's the pipeline you added to your labeling project that fetches these documents. The labelers then indicate if each document is relevant or not relevant to the query. If they're unsure, they can also flag a document for review by someone else.
deepset Cloud offers a template created specifically for document retrieval labeling projects. All you need to do is give it a name and deploy it. The template combines different retrieval methods and a custom component that randomly interleaves the retrieval results.
You can also use a pipeline from the workspace where you're creating the labeling project. Remember, though, that this pipeline must be different from the document retrieval pipeline you'll be evaluating with the evaluation dataset that results from your labeling project. If the pipeline you use to create the evaluation set is the same or too similar to the pipeline you evaluate with this dataset, you'll get a perfect score, but it won't reflect the actual performance of your pipeline.
Creating labeling guidelines. Add any best practices, expectations, and advice for the labelers. They'll be able to check the guidelines when annotating. This step is optional but recommended.
Setting up a query target. Specify the number of queries you want the labelers to run in total. The recommended number of queries is between 50 and 300. With this number, you can get a solid evaluation dataset.
This is optional, but it helps you ensure there are enough queries in the resulting dataset. It also introduces an element of gamification for the labelers, where they can check the progress of the project and the number of queries each of them ran.
Preparing and uploading the files. Currently, the project runs on all files that exist in the deepset Cloud workspace.

After you create a project, you can invite a team of labelers to help you label. To do this, you add them as Admin users to your deepset Cloud organization and point them to the labeling project.

Once you're happy with the number of labels, you can export them into a .CSV file. You can do this from the project options on the Labelling page.