Experiment Metrics

Running a search pipeline against an evaluation dataset to check if the results it returns are accurate is called an experiment.

Understanding the Metrics

Every time you run an experiment, it outputs metrics you can use to measure your pipeline's performance. There are different metrics for information retrieval and question answering pipelines. Here's an explanation of what they tell you:

Metrics for Information Retrieval Pipelines

Recall

Recall measures how many times the correct document was among the retrieved documents over a set of queries. For a single query, the output is binary: the correct document is either among the retrieved documents or not.

Over the entire dataset, a recall score is a number between zero (no query retrieved the proper document) and one (all queries retrieved the correct documents).

If there are multiple correct documents for one query, the metric recall_single_hit measures if at least one of the correct documents is retrieved, and the metric recall_multi_hit measures how many correct documents for one query are retrieved.

The recall is affected by the number of documents that the retriever returns. The fewer documents returned, the more difficult it is to retrieve the correct documents. Ensure that you set the retriever's top_k parameter to an appropriate value in the pipeline that you are evaluating.

Mean Reciprocal Rank (MRR)

MRR is the position (rank) of the first correctly retrieved document. It does this to account for the fact that a query elicits multiple responses of varying relevance.

MRR can be a value between zero (no matches) and one (the system retrieved a correct document for all queries as the top result).

Mean Average Precision (mAP)

mAP is the position of every correctly retrieved document. It can be a value between zero (no matches) and one (the system retrieved correct documents for all top results).

This metric is handy when there is more than one correct document to be retrieved.

Normalized Discounted Cumulative Gain (NDCG)

NDCG is a ranking performance measure that focuses on the relevant document's position in search results. It counts the relevancy of each document, then sums up all the positions of the relevant documents and normalizes the result.

NDCG is a float number between 0.0 and 1.0, where 1.0 is the perfect score.

Precision

Precision tells you how precise the system is. It counts how many of all retrieved documents were relevant to the query. For example, if the system retrieves 40 documents and 30 are relevant, its precision is 30/40 = 3/4.

Metrics for Question Answering Pipelines

Exact Match (EM)

An exact match measures the proportion of cases where the predicted answer is identical to the correct answer. For example, for the annotated question-answer pair What is Haystack? + A question answering library in Python, even a predicted answer like A Python question answering library would yield a zero score because it does not match the expected answer 100%.

F1

The F1 score is more forgiving and measures the word overlap between the labeled and the predicted answer. Whenever the EM is 1, F1 is also 1.

Semantic Answer Similarity (SAS)

SAS uses a transformer-based, cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap. SAS is particularly useful for seeking out cases where F1 doesn't give a good indication of the validity of a predicted answer.

Integrated and Isolated Evaluation

There are two ways in which the nodes in your pipeline are evaluated: how they perform in the pipeline (integrated evaluation) and how they perform in isolation (isolated evaluation).

The Experiment Details page shows metrics for both integrated and isolated evaluation.

Integrated Evaluation

In integrated evaluation, a node is evaluated based on the actual output of the pipeline compared with the expected output. It's worth remembering that in integrated evaluation, a node receives the predictions from the preceding node as input, so the performance of the preceding node affects the node you're evaluating. For example, when evaluating a reader in a pipeline, you must remember that it returns results based on the documents it received from the retriever. If the retriever fetches the reader wrong documents, the reader's performance will be poor.

Isolated Evaluation

In isolated evaluation, a node receives only the documents that contain the answers as input. So, the isolated evaluation shows how the node performs if it gets the perfect input from the preceding node.
Isolated evaluation is currently only possible for readers.

How to Make Sense of It

The two types of evaluation can give you a hint of which node to improve. Suppose a reader performs poorly in integrated evaluation but shows good performance in isolated evaluation. In that case, it means there's something wrong with the input it receives, and you should improve your retriever.

If the results from both evaluation types don't differ that much, you may need to improve the reader to achieve better results.