Tutorial: Building Your First Document Retrieval App

This tutorial teaches you how to build a document retrieval system in the easiest and fastest possible way. It uses the UI for uploading the sample files and a template for creating the document retrieval pipeline.

  • Level: Beginner
  • Time to complete: 10 minutes
  • Prerequisites:
    • This tutorial assumes a basic knowledge of NLP.
    • You must be an Admin to complete this tutorial.
    • Make sure you have a deepset Cloud workspace where the information retrieval pipeline will run.
  • Goal: After completing this tutorial, you will have built a complete English document retrieval system from scratch that can fetch NHS documents.

Upload Files

First, let's get the files the search will run on into deepset Cloud.

  1. Download the .zip file from gdrive and unzip it to a location on your computer.
  2. Log in to deepset Cloud, switch to the right workspace, and go to Data>Files.
  3. Click Upload Files, drag the files you unpacked in step 1, and drop them to the Upload Files window.
  4. Click Upload.
  5. Wait until the upload finishes. You should have around 900 files. You can check the number of files on the Dashboard.

Result: Your files have been uploaded and are shown on the Files page.

The Files page with the NHS files uploaded.

Create a Pipeline

The next step is to define the components of your search app. We'll use a document retrieval template with an embedding-based retriever to create the pipeline.

  1. Go to Pipelines>New Pipeline.
  2. Under YAML Editor, click Create Pipeline and select From Template.
The YAML Editor component with the From Template option underlined.
  1. When the templates show up, find the Semantic Document Search template and click Use Template.

  2. When the Pipeline Designer opens, change the pipeline name in line 7 to NHS_doc_retrieval and save the pipeline.

    # If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
    # This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
    # Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
    
    # This is a document search pipeline that searches for documents based on semantic similarity. It uses a vector-based search.
    version: '1.21.0'
    name: 'NHS_doc_retrieval'
    
  3. Click Deploy to start indexing and ready your pipeline for running a search.

  4. Return to the Pipelines page and wait until the status of your pipeline changes to Indexed. This can take a couple of minutes.
    Tip: When you hover your mouse over the status, you can see how many files have already been indexed.

Result: You created and deployed a pipeline, which means your documents have been indexed, and you can now run a search. Your pipeline shows on the Pipelines page with the status Indexed.

The Pipelines page with the NHS doc retrieval pipeline shown as indexed and deployed

Search

Let's see what the pipeline can do.

  1. Go to Search.
  2. Choose NHS_doc_retrieval as the pipeline.
  3. Type "How do I treat atopic skin?" and search for relevant documents. You should get a number of documents sorted by the most relevant ones.

Result: Congratulations! You have built a search system that can retrieve documents related to health. You can now ask health-related queries, and it will find relevant documents.