Evaluation Datasets

An evaluation dataset is a file with gold answers for your search system. Learn about the format of the dataset and how to prepare it.

What's an Evaluation Dataset?

An evaluation dataset is an annotated set of data held back from your model. The annotations, or labels, can be question-answer pairs (for a question-answering system) or question-passage pairs (for an information retrieval system). They indicate the gold answers, which are the answers that you would expect your search system to return.

The evaluation dataset in deepset Cloud is based on the files that you uploaded to Data>Files. After you add your evaluation set, deepset Cloud automatically matches the labels in your dataset with the files in your workspace using file names. If there are labels for which there is no match, deepset Cloud lets you know. The evaluation dataset only works for the files that existed in deepset Cloud at the time when you uploaded the evaluation set.

Evaluation Dataset for Pipeline Evaluation

During the evaluation, deepset Cloud uses the questions from your evaluation dataset and runs them through the evaluated pipeline letting the system find the answers in all your files. The more files you provide, the more complex the task for the system is.
deepset Cloud then compares the answers returned by your search system to the gold answers from your evaluation dataset and, based on the results, calculates the metrics you can use to tweak your pipeline and boost its performance.

Dataset Format

The evaluation dataset must be a .csv file with the following columns:

  • question
  • text
  • context
  • file_name
  • answer_start
  • answer_end
  • filters (optional)

Here's an evaluation dataset for Harry Potter. This example is meant to show you the format your dataset should follow.

Preparing Your Own Evaluation Dataset

If you want to prepare an annotated dataset for question answering, you can use deepset's Annotation Tool.