About Experiments

After you created a pipeline, it’s time to evaluate it using experiments. Test your pipeline against an evaluation dataset and share it with other users to collect their feedback.


Supported Nodes

Currently, deepset Cloud supports the evaluation of Preprocessor, Retriever, and Reader nodes. You can combine all types of nodes in your pipelines, but only these three are evaluated.


Experiments are an essential step in designing your pipeline as they help you to:

  • Find the best pipeline configuration for your use case.
  • Evaluate if the pipeline performance is sufficient to move it to production.
  • Generate new hypotheses for improving your pipeline.
  • Track the settings you used for previous versions of your pipeline.

During an experiment run, deepset Cloud runs all the questions from your evaluation dataset through the pipeline that you chose for evaluation. The pipeline searches for answers in all the files that you indicated for your experiment run. deepset Cloud then compares the results that the pipeline returned with the answers that you labeled in the evaluation dataset and calculates the metrics that you can use to tweak your pipeline.

Running experiments is a formal way to evaluate your model and see how often it predicts the correct answer.

Overview of Pipeline Evaluation

Evaluating a pipeline involves a couple of steps. This image shows you an overview of the whole process:

 A diagram showing the steps for pipeline evaluation. The steps are then described below. A diagram showing the steps for pipeline evaluation. The steps are then described below.

  1. First, you choose the files to include in the evaluation. The pipeline you're evaluating will run on these files. the more files, the more difficult the task of finding the right answer is.


Coming soon!

Currently, the evaluation runs on all the files in your workspace but we're working on making it possible to choose a subset of files for an experiment.

  1. The evaluation dataset contains the annotated data used to estimate your model skills. You can choose one dataset per evaluation run.
  2. An experiment is the part where you actually check how your pipeline is performing. When an experiment finishes, you can review its details on the Experiment Details page, where you can check:
  • The experiment status. If the experiment failed, check the Debug section to see what went wrong.
  • The details of the experiment: the pipeline and the evaluation set used.
  • Metrics for pipeline components. You can see both metrics for integrated and isolated evaluation. For more information about metrics, see Experiments and Metrics.
  • The pipeline parameters and configuration used for this experiment. It may be different from the actual pipeline as you can update your pipeline just for an experiment run, without modifying the actual pipeline.
    You can't edit your pipeline in this view.
  • Detailed predictions. Here you can see how your pipeline actually did and what answers it returned (predicted answer) in comparison to the expected answers. For each predicted answer, deepset Cloud displays the exact match, F1 score, and rank. The predictions are shown for each node separately.
    You can export these data into a CSV file. Open the node whose predictions you want to export and click Download CSV.
    If you're not happy with your pipeline performance, you can try exchanging the reader or retriever nodes, or changing the model you're using. When you've done this, run the experiment again using the updated pipeline and check if it improved the results.
    If you need more information about the metrics and their meaning, see Experiment Metrics.
  1. With deepset Cloud, you can easily demonstrate what your pipeline can do. Invite people to your organization, let them test your search, and collect their feedback. Everyone listed on the Organization page can run a search with your pipelines. Have a look at Guidelines for Onboarding Your Users to ensure you set the right expectations for your pipeline performance.

  2. Reviewing search statistics is another way to check how your pipeline is doing. You can find all the data on the Dashboard in the LATEST REQUESTS section. Here you can check:

  • The actual query
  • The answer with the highest score
  • The pipeline used for the search
  • Top file, which for a QA pipeline is the file that contains the top answer, and for a document retrieval pipeline is the file with the highest score
  • Who ran the search
  • How many seconds it took to find the answer

Did this page help you?