Enable Streaming

Streaming refers to a large language model generating text as it's produced rather than waiting for the entire response to be ready before showing it. It's similar to watching someone type real-time. Enable streaming for the Generators in your pipelines.

Streaming is a technique often used in chat interfaces. It makes the responses seem faster as users can immediately see the output and can start reading while the rest of the text generates. It also makes it possible to interrupt the LLM if needed. This is particularly useful for longer responses where waiting for the generation to complete may take a couple of seconds.

Enabling Streaming

To enable streaming, set the Generator's streaming_callback parameter to deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback. If your pipeline has multiple Generators, you can enable streaming for each one.

If no streaming_callback is set, the last Generator in the pipeline streams.

Here is an example of a Generator with streaming enabled:

CohereGenerator with the streaming_callback parameter filled in to enable streaming

YAML configuration:

CohereGenerator:
    type: haystack_integrations.components.generators.cohere.generator.CohereGenerator
    init_parameters:
      api_key:
        type: env_var
        env_vars:
        - COHERE_API_KEY
        - CO_API_KEY
        strict: false
      model: command-r
      streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback

Streaming with API

You can use streaming with the stream API endpoints: Chat Stream and Search Stream. This is an example request to the Search Stream endpoint.

curl --request POST \
     --url https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME/search-stream \
     --header 'accept: application/json' \
     --header 'authorization: Bearer DEEPSET_API_KEY' \
     --header 'content-type: application/json' \
     --data '
{
  "debug": false,
  "include_result": true,
  "view_prompts": false,
  "query": "who started all-girl bands?"
}
'
import requests

url = "https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME/search-stream"

payload = {
    "debug": False,
    "include_result": True,
    "view_prompts": False,
    "query": "who started all-girl bands?"
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": "Bearer DEEPSET_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

Replace:

  • WORKSPACE_NAME: With the name of the workspace containing your pipeline.
  • PIPELINE_NAME: With the name of the pipeline to use for search.
  • DEEPSET_API_KEY: With your deepset API key.

Determining Which Generator Streamed

If your pipeline includes multiple Generators with streaming enabled, you can determine which Generator streamed a specific chunk of data by checking its name in the API response. This information is available in the delta field.

Below is a partial example of a response from the Search Stream endpoint, showing two streaming-enabled Generators: chat_summary_llm and qa_llm.

{
   "query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
   "delta":{
      "text":"girl bands?",
      "meta":{
         "index":0,
         "deepset_cloud":{
            "component":"chat_summary_llm" // this is the name of the Generator that streamed
         }
      }
   },
   "type":"delta"
}


{
   "query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
   "delta":{
      "text":"Base",
      "meta":{
         "index":0,
         "deepset_cloud":{
            "component":"qa_llm" // this is the name of the Generator that streamed
         }
      }
   },
   "type":"delta"
}