Tutorial: Creating an Evaluation Dataset for Document Search
This tutorial guides you through the process of setting up a labeling project and labeling the retrieved documents to create an evaluation dataset.
- Level: Beginner
- Time to complete:
- Creating the project: 5 minutes
- Labeling: 5 minutes if you label five queries as we suggest for the sake of this tutorial. (In real-life scenarios, we recommend you label at least 100 queries)
- Prerequisites:
- This tutorial assumes some knowledge of retrieval methods.
- You must be an Admin to create and label a project.
- You must have a deepset Cloud workspace where you'll create this project.
- Goal: After completing this tutorial, you'll have created and set up a labeling project for document search to create an annotated dataset for your document retrieval pipeline. You'll also know how to run queries and indicate whether the resulting documents are relevant.
To create the project, you'll use the pipeline template available in deepset Cloud, and a set of business and sports articles scraped from https://www.thenews.com.pk/. You can replace the pipeline and the files with your own. - Keywords: Document retrieval, evaluation dataset, annotated dataset, labeling
Create Your Workspace
Create a separate workspace for your project. You can skip this step if you already have a workspace you want to use.
-
Log in to deepset Cloud with your Admin account.
-
Click the toggle icon next to the workspace name, type "labeling" as the name of the new workspace, and click Create.
Result: You created a workspace, and it's showing in the list of workspaces if you click the workspace toggle icon.
Upload Files
These files will first be preprocessed and indexed. Once this is done, you'll label them as part of your project. This tutorial uses a set of newspaper articles, but you can replace them with your own files.
- Download the .zip file from gdrive and unpack it on your computer. It contains one file called Articles.txt.
- Go back to deepset Cloud, make sure you're in the labeling workspace, and go to Files.
- Click Upload Files.
- Select the file you extracted in step 1, drop it into the Upload Files window, and click Upload.
Result: Your file is uploaded, and you can see it on the Files page in your workspace.
Create a Labeling Project
- In the navigation, click Labeling and choose New project.
- Type "document-search" as the project name and "The goal of this project is to create an evaluation dataset for a document search pipeline running on newspaper articles." as the description and click Create.
You land on the project page, where you can see its overview and the steps you must complete. - Click Settings.
-
Under Add a Pipeline, click Select a pipeline and choose Use Template. This is a pipeline template designed specifically for labeling.
-
Type "news_articles" as the pipeline name and click Create Pipeline.
You land in a new browser tab with the pipeline open in Pipeline Designer. -
Click Deploy.
-
Return to the browser tab with your labeling project settings.
-
Under Labeling Guidelines, click Add Guidelines, and paste this text:
In document retrieval, the query is a question and the answer is one document, or one passage of text.
# What's a good question?
A fact-seeking question that asks about information present in the documents.
# What's a bad question?
Don't ask ambiguous, opinion seeking, or incomprehensible questions.
# How to mark an answer?
Click thumbs up for correct documents, thumbs down for incorrect documents,
click the flag if you're unsure and want another labeler to check the document.
- Type 5 as the query target.
Tip: For real projects, we recommend you run at least 100 queries. But for the sake of this tutorial, let's just run 5.
Result: You've configured your labeling project. You're ready to invite labelers to your project and start labeling.
Invite Labelers
Now, let's add people to your labeling project so that they can start labeling. You must add them as Admins to your organization.
-
In deepset Cloud, click your initials in the top right corner and choose Organization.
-
Click Invite Users.
-
Type the user details, choose Admin as the role, and click Send Invite. The user receives an email asking them to set the password. Once they do it, they can log in to deepset Cloud.
-
Repeat this procedure until you have invited all the labelers.
-
Ask the labelers to log in to deepset Cloud.
Result: You have invited labelers to your organization, and they can now start labeling.
Label the Documents
Now that everything is ready, you can start labeling.
-
In deepset Cloud, make sure you're in the labeling workspace, and click Labeling in the navigation.
-
Open the document-search project and click Start Labeling. You're redirected to the Labeling Query page.
-
Type "what was special about the medals for 2016 olympics?" as the query and click Search.
-
Mark the first document as relevant.
-
Type "who is Kim Kardashian dating?" and label the results. You can see that the Query Target leaderboard is updated to show the number of queries already run.
-
Continue asking questions until you reach the query target of 5. Here are example questions you can ask:
"how did Brexit impact tourism in London?"
"how did the launch of Pokemon Go influence Nintendo's shares?"
"how much did Oppenheimer Blue cost?"
Or any other question related to world news or sports.
Result: You have labeled the required number of queries. The project is completed and you can now export the labels.
Export the Labels
-
In the navigation, click Labeling .
-
Find the document-search project and click the ellipsis icon on its card.
-
Choose Export Labels (.csv). The labels are downloaded to your computer as a CSV file.
Result: Congratulations! You created, configured, and completed a labeling project. You also exported the labels to your computer. You can now use them as an evaluation dataset for experiments.
Updated 6 months ago