Evaluation Metrics

Running a search pipeline against an evaluation dataset to check if the results it returns are accurate is called an experiment.

Understanding the Metrics

Every time you run an experiment, it outputs metrics you can use to measure your pipeline's performance. There are different metrics for information retrieval and question answering pipelines. Here's an explanation of what they tell you:

Metrics for RAG Pipelines

Groundedness

The groundedness score measures how well a large language model's (LLM) responses stick to your documents. The score is a value between 0 and 1, with 1 indicating that all responses are grounded in the documents. Pipelines that achieve scores near 1 are less prone to producing hallucinations, which makes them more reliable and accurate.

No Answer Ratio

No answer ratio is the proportion of queries for which the model claims there are no answers in the documents. It then answers with phrases like "The question cannot be answered" or "The answer is not in the documents", and the like, depending on what's in the prompt. No answer ratio ranges between 0 and 1, with 0 meaning the model always provides an answer and 1 meaning it never does.

Query Latency

Query latency refers to the duration (measured in seconds) it takes to process a query, from the time it's issued to when the answer is generated. A lower latency means a quicker response from the system. The speed largely depends on the model generating the answers. If the response time isn't cutting it for you, experimenting with different large language models (LLMs) might help. Sometimes you have to balance between getting speedy results and ensuring accuracy

Optimizing the Scores

Balancing a high groundedness score with a low no answer ratio is challenging but crucial for effective RAG pipelines. It's hard to get a very good groundedness score with a low no answer ratio, but that's what you're aiming for in your RAG system.

If you're not happy with the metrics, try changing:

  • the LLM you're using
  • the prompt
  • the Retriever or the Retriever parameters, such as top_k.

Metrics for Document Search Pipelines

Recall

Recall measures how many times the correct document was among the retrieved documents over a set of queries. For a single query, the output is binary: the correct document is either among the retrieved documents or not.

Over the entire dataset, a recall score is a number between zero (no query retrieved the proper document) and one (all queries retrieved the correct documents).

If there are multiple correct documents for one query, the metric recall_single_hit measures if at least one of the correct documents is retrieved, and the metric recall_multi_hit measures how many correct documents for one query are retrieved.

The recall is affected by the number of documents that the retriever returns. The fewer documents returned, the more difficult it is to retrieve the correct documents. Ensure that you set the retriever's top_k parameter to an appropriate value in the pipeline that you are evaluating.

Mean Reciprocal Rank (MRR)

MRR is the position (rank) of the first correctly retrieved document. It does this to account for the fact that a query elicits multiple responses of varying relevance.

MRR can be a value between zero (no matches) and one (the system retrieved a correct document for all queries as the top result).

Mean Average Precision (mAP)

mAP is the position of every correctly retrieved document. It can be a value between zero (no matches) and one (the system retrieved correct documents for all top results).

This metric is handy when there is more than one correct document to be retrieved.

Normalized Discounted Cumulative Gain (NDCG)

NDCG is a ranking performance measure that focuses on the relevant document's position in search results. It counts the relevancy of each document, then sums up all the positions of the relevant documents and normalizes the result.

NDCG is a float number between 0.0 and 1.0, where 1.0 is the perfect score.

Precision

Precision tells you how precise the system is. It counts how many of all retrieved documents were relevant to the query. For example, if the system retrieves 40 documents and 30 are relevant, its precision is 30/40 = 3/4.

Metrics for Question Answering Pipelines

Exact Match (EM)

An exact match measures the proportion of cases where the predicted answer is identical to the correct answer. For example, for the annotated question-answer pair What is Haystack? + A question answering library in Python, even a predicted answer like A Python question answering library would yield a zero score because it does not match the expected answer 100%.

F1

The F1 score is more forgiving and measures the word overlap between the labeled and the predicted answer. Whenever the EM is 1, F1 is also 1.

Semantic Answer Similarity (SAS)

SAS uses a transformer-based, cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap. SAS is particularly useful for seeking out cases where F1 doesn't give a good indication of the validity of a predicted answer.

Integrated and Isolated Evaluation

There are two ways in which the nodes in your pipeline are evaluated: how they perform in the pipeline (integrated evaluation) and in isolation (isolated evaluation).

The Experiment Details page shows metrics for both integrated and isolated evaluation.

Integrated Evaluation

In integrated evaluation, a node is evaluated based on the actual output of the pipeline compared with the expected output. It's worth remembering that in integrated evaluation, a node receives the predictions from the preceding node as input, so the performance of the preceding node affects the node you're evaluating. For example, when evaluating a reader in a pipeline, you must remember that it returns results based on the documents it received from the retriever. If the retriever fetches the reader the wrong documents, the reader's performance will be poor.

Isolated Evaluation

In isolated evaluation, a node receives only the documents that contain the answers as input. So, the isolated evaluation shows how the node performs if it gets the perfect input from the preceding node.
Isolated evaluation is currently only possible for readers.

How to Make Sense of It

The two types of evaluation can give you a hint of which node to improve. Suppose a reader performs poorly in integrated evaluation but shows good performance in isolated evaluation. In that case, it means there's something wrong with the input it receives, and you should improve your retriever.

If the results from both evaluation types don't differ that much, you may need to improve the reader to achieve better results.