Check the Groundedness Score

Ensure your AI's answers are data-backed and reliable. Learn how to access and interpret the score using the Groundedness Observability Dashboard.

What's Groundedness Score?

The groundedness score is a retrieval-augmented generation (RAG) pipeline metric that measures how well-grounded the generated answers are in the documents. For your RAG pipelines, it's essential that the answers the LLM generates are grounded in your data. This ensures the generated content is based on information you can rely on and verify. It's especially important in apps where accuracy is critical. Users are also more likely to trust a system that consistently provides grounded and accurate information.

You can monitor your RAG pipelines' groundedness score using the Groundedness Observability Dashboard. The score ranges from 0 (poor groundedness) to 1 (very good; all answers are grounded in the data). It's calculated using a cross-encoder model as follows:

  1. For each sentence that requires verification, the model determines the highest groundedness score among all references supporting that sentence.
  2. The model calculates the average of these maximum scores for all verified sentences in a document. This gives the document's groundedness score.
  3. The overall groundedness score is the average of all individual document groundedness scores.

Only sentences that require verification are included in this calculation.

Using Observability Dashboard

  1. Log in to deepset Cloud and go to Groundedness.
  2. Choose the pipeline whose groundedness you want to check and you should see the data.
The groundedness observability dashboard with the groundedness score for a RAG pipeline displayed

Navigating the Dashboard

At the top of the dashboard, you can check the overall groundedness score for your pipeline (1). The graph in the Groundedness Score section shows how the score changed over time. Changes in groundedness can happen if the data, the model, or the pipeline is updated. By hovering your mouse over any point on the graph, you can see the average groundedness score for answers at that point in time (2).

a software interface displaying a line graph, which is titled "Groundedness Score." The graph shows a shaded area under the line, indicating the volume or range of scores over time. At the top left, there's a summary box with a "Groundedness Score" of 0.71 labeled as "Fair," and a counter for "Queries" which is set at 19. The line graph has a peak at the point marked with a tooltip showing "2023-12-11 01:00:00 Groundedness 0.9841688096168496," where the cursor appears to be hovering. There are navigational and settings icons at the top right of the graph area. The x-axis represents time, while the y-axis represents the groundedness score, ranging from 0 to 1. The graph's line starts high on the left and slopes downward towards the right.

You can choose the time range for the data and switch between pipelines. Groundedness score is available only for retrieval augmented generation (RAG) pipelines.

The Documents Referenced section shows you how a document's ranking correlates with its reference frequency. The ranking comes from the pipeline (from the last node that ranks documents, typically a Ranker or a Retriever). Beneath each rank, you can see a percentage representing that document's share of all references.

a bar chart displaying the distribution of document references across four different ranks. The y-axis represents the number of references, and the x-axis lists the ranks from 1 to 4. Rank 1 has the highest number of references at 10, which is about 38.46% of the total, as indicated above the bar. Rank 2 follows with 8 references, approximately 30.77%. Rank 3 has 6 references, making up 23.08%. Finally, Rank 4 has the fewest references at 2, accounting for 7.69%. Above the chart, boxes display the percentage of references for each rank. The chart is designed to provide a visual representation of how often certain documents are referenced, with Rank 1 being the most referenced and Rank 4 the least.

Understanding these metrics can help in several ways:

  • It's an indication of your retriever's performance. If documents with lower ranks are referenced more often, it means the retriever could be improved.
  • It's an opportunity to save costs. By identifying and excluding documents that are rarely used as references, you can reduce the number of tokens sent to the model in the prompt. For example, if documents ranked at 4 are not referenced anywhere, you can set the pipeline's top_k to 3. This way only documents ranked 1 to 3 are sent in the prompt as the context to generate answers.
    (Tip: Modify the top_k parameter of the node that sends documents to PromptNode. In a RAG pipeline, this is typically the Retriever.)