Monitor Pipeline Performance
If you're wondering how many requests you can send from your search system to deepset Cloud or what speed you can expect, here are your answers.
Scaling
Unexpected surges in traffic are a challenge, but deepset Cloud seamlessly handles scalability in response to increased demand. deepset Cloud uses autoscaling, which automatically adjusts the infrastructure based on the usage. This ensures there are no disruptions in service, regardless of the number of concurrent requests. The system dynamically reallocates resources to maintain optimal performance. Scaling is entirely automated and doesn’t require any manual adjustments from you.
Speed
Several factors influence the speed of your search system:
- Model size
Pipelines with large models are slower. With LLMs, it’s often a tradeoff between speed and performance. Large models, like GPT-4, may give better answers than smaller models, but they’re also much slower because of their size. - Where your model runs
You can run models locally, on your machine, or remotely, through API. Small models used with Retrievers or Readers are faster to run locally because they don’t need much power. Large models, like ChatGPT or GPT-4, need dedicated and optimized hardware, and running them remotely is usually faster. - Pipeline configuration
Some components or their settings can slow your pipeline. For example, a Ranker improves the results but also slows the system down. The same goes for a Reader with a high top_k value. It’s often about finding the balance between speed and performance. - The length of generated responses
For generative QA pipelines, the number of tokens you want the model to generate as an answer influences your system's speed. The longer the answer, the slower the system.
Optimizing the speed of your pipeline is all about finding the balance among these factors. It involves choosing the right-sized model, running it where it’s fastest, and finding a configuration with optimal performance.
Pipeline Statistics
The deepset Cloud dashboard gives you basic information about your pipeline, such as the average response time or the number of searches ran. To check what queries were asked, the top answers, and how long it took to find them, click the pipeline's name on the Pipelines page. This brings you to Pipeline Details, where you can view all the information about your pipeline.
Logs
Changes on their way
We're still actively working on this feature to make it better. This page describes its current, first implementation. We'll be updating it soon to make it smoother.
Check pipeline logs to see what happened since it was deployed. You can view the logs on the Pipeline Details page (just click the pipeline name to get there). Expand messages for more details and possible actions.
Groundedness Score
Use the Groundedness Observability dashboard to track your RAG pipeline's groundedness score. This score tells you if the pipeline's answers are grounded in your documents. On the dashboard, you can observe how the score fluctuates over time and verify if the documents with the highest rank are referenced most often.
For more information on Groundedness Observability, see Check the Groundedness Score.
Updated 5 months ago