Monitor Pipeline Performance

If you're wondering how many requests you can send from your search system to deepset Cloud or what speed you can expect, here are your answers.


Unexpected surges in traffic are a challenge, but deepset Cloud seamlessly handles scalability in response to increased demand. deepset Cloud uses autoscaling, which automatically adjusts the infrastructure based on the usage. This ensures there are no disruptions in service, regardless of the number of concurrent requests. The system dynamically reallocates resources to maintain optimal performance. Scaling is entirely automated and doesn’t require any manual adjustments from you.


Several factors influence the speed of your search system:

  • Model size
    Pipelines with large models are slower. With LLMs, it’s often a tradeoff between speed and performance. Large models, like GPT-4, may give better answers than smaller models, but they’re also much slower because of their size.
  • Where your model runs
    You can run models locally, on your machine, or remotely, through API. Small models used with Retrievers or Readers are faster to run locally because they don’t need much power. Large models, like ChatGPT or GPT-4, need dedicated and optimized hardware, and it’s usually faster to run them remotely.
  • Pipeline configuration
    Some components or their settings can make your pipeline slower. For example, a Ranker improves the results but also slows the system down. The same goes for a Reader with a high top_k value. It’s often about finding the balance between speed and performance.
  • The length of generated responses
    For generative QA pipelines, the number of tokens you want the model to generate as an answer influences the speed of your system. The longer the answer, the slower the system.

Optimizing the speed of your pipeline is all about finding the balance among these factors. It involves choosing the right-sized model, running it where it’

Pipeline Statistics

The deepset Cloud dashboard gives you basic information about your pipeline, such as the average response time or the number of searches ran. If you want to check what queries were asked, the top answers, and how long it took to find them, click the name of the pipeline on the Pipelines page. This brings you to Pipeline Details, where you can view all the information about your pipeline.



Changes on their way

We're still actively working on this feature to make it better. This page describes its current, first implementation. We'll be updating it soon to make it smoother.

Check pipeline logs to see what happened since it was deployed. You can view the logs on the Pipeline Details page (just click the pipeline name to get there). Expand messages for more details and possible actions.

Groundedness Score

Use the Groundedness Observability dashboard to track your RAG pipeline's groundedness score. This score tells you if the answers the pipeline generates are grounded in your documents. On the dashboard, you can observe how the score fluctuates over time and verify if the documents with the highest rank are the ones referenced the most often.
For more information on Groundedness Observability, see Check the Groundedness Score.