Enable Streaming
Streaming refers to a large language model generating text as it's produced rather than waiting for the entire response to be ready before showing it. It's similar to watching someone type real-time. Enable streaming for the LLMs in your pipelines.
About This Task
Streaming is a technique often used in chat interfaces. It makes the responses seem faster as users can immediately see the output and can start reading while the rest of the text generates. It also makes it possible to interrupt the LLM if needed. This is particularly useful for longer responses where waiting for the generation to complete may take a couple of seconds.
In Haystack Enterprise Platform, streaming is enabled by default for all LLMs.
Streaming in API endpoints
When using Haystack Platform through the API, use the stream endpoints to get streaming responses. For example, use the Chat Stream endpoint to get a streaming response from a chat pipeline or the Search Stream endpoint to get a streaming response from a search pipeline.
Configure Streaming Components
Use the top-level streaming_components field in your pipeline YAML to control which components stream their output. This is the recommended way to configure streaming. It works consistently across both the Playground and the API endpoints.
Enable Streaming for Specific Components
To specify which components should stream, list their names in the streaming_components field:
max_runs_per_component: 100
streaming_components:
- llm_1
- llm_2
- agent
components:
llm_1:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
# ... component configuration
llm_2:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
# ... component configuration
agent:
type: haystack.components.agents.agent.Agent
# ... component configuration
Enable Streaming for All Components
To enable streaming for all compatible components in your pipeline, use the wildcard value:
max_runs_per_component: 100
streaming_components: all
components:
# ... your pipeline components
Default Behavior
If you don't include the streaming_components field, only the last streaming-capable component in your pipeline streams its output.
Legacy: Streaming with streaming_callback
This approach is supported for backward compatibility with existing pipelines. For new pipelines, use the streaming_components field instead.
Older pipelines may have streaming enabled by setting the streaming_callback init parameter on individual components instead of using the streaming_components YAML field.
This approach only works predictably in the Playground. The API uses the same behavior by default: include_tool_calls defaults to "rendered", which embeds tool calls as inline markdown in delta events rather than emitting separate tool_call_delta events. See Stream request parameters for details.
For legacy pipelines, set streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback on each component you want to stream. All configured components receive streaming output through the API. This differs from the streaming_components field: when you omit that field, only the last streaming-capable component streams by default.
Generators and ChatGenerators
CohereGenerator:
type: haystack_integrations.components.generators.cohere.generator.CohereGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- COHERE_API_KEY
- CO_API_KEY
strict: false
model: command-r
streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback
Agents
For agents, streaming_callback is set at the agent level, not on the inner chat_generator:
agent:
type: haystack.components.agents.agent.Agent
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: gpt-4o
streaming_callback: # leave empty on the inner generator
system_prompt: You are a deep research assistant.
streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback
tools:
# ...
Streaming with API
You can use streaming with the stream API endpoints: Chat Stream and Search Stream. This is an example request to the Search Stream endpoint.
- cURL
- Python
curl --request POST \
--url https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME/search-stream \
--header 'accept: application/json' \
--header 'authorization: Bearer DEEPSET_API_KEY' \
--header 'content-type: application/json' \
--data '
{
"debug": false,
"include_result": true,
"view_prompts": false,
"query": "who started all-girl bands?"
}
'
import requests
url = "https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME/search-stream"
payload = {
"debug": False,
"include_result": True,
"view_prompts": False,
"query": "who started all-girl bands?"
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"authorization": "Bearer DEEPSET_API_KEY"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)
Replace:
WORKSPACE_NAME: With the name of the workspace containing your pipeline.PIPELINE_NAME: With the name of the pipeline to use for search.DEEPSET_API_KEY: With your Haystack Platform API key.
Default Streaming Behavior in the API
By default, the streaming endpoints only stream output from the last streaming-capable component in your pipeline. This matches the default pipeline behavior described in the Default Behavior section above.
To stream output from more than one component or from a specific component, configure the streaming_components field in your pipeline YAML before using the API. See Enable Streaming for Specific Components for details.
Stream Request Parameters
These request body fields control which SSE event types you receive from search-stream and chat-stream:
| Parameter | Default | Effect |
|---|---|---|
include_result | false | When true, emits a result event with the full pipeline output before the terminal done frame |
include_tool_calls | rendered | Controls how agent tool calls appear in the stream (see below) |
include_tool_call_results | false | When true, emits tool_call_result events after each tool completes |
include_reasoning | false | When true, emits reasoning events during model reasoning steps |
include_tool_calls
The default is rendered. This setting only affects agent pipelines that make tool calls:
| Value | Behavior |
|---|---|
rendered (default) | Tool calls are rendered as inline markdown inside delta events. No tool_call_delta events are emitted. |
true | Tool calls emit structured tool_call_delta events as they progress |
false | Tool calls are omitted from the stream entirely |
The rendered default matches Playground behavior and is the simplest option for chat-style UIs. Set include_tool_calls to true when you need programmatic access to tool names, arguments, and call IDs with tool_call_delta events.
Streaming Event Types
The streaming endpoints use Server-Sent Events (SSE) and return a sequence of events. Each event has a type field that identifies what it contains. Most events also include a query_id. The ping event is the only exception. The event types are mutually exclusive, each event carries exactly one of the following:
| Event type | When it's sent | What it contains |
|---|---|---|
delta | During generation | delta object with text and meta; optional start (first chunk from a component), finish_reason (when generation ends), and index (component index when multiple components stream) |
result | End of stream (when include_result=true) | The full pipeline result |
done | After the last event, on success | Signals that the stream completed successfully. Not sent when an error occurs. |
error | When the pipeline fails | An error message and an error_category that classifies who can fix the problem. Sent instead of done. |
tool_call_delta | During agent tool use (only when include_tool_calls=true) | Structured information about a tool call in progress. Not emitted when the default "rendered" mode is used |
tool_call_result | After a tool call completes (when include_tool_call_results=true) | The result returned by a tool |
reasoning | During agent reasoning (when include_reasoning=true) | A reasoning step from the model |
ping | Periodically | Keep-alive signal. Contains only type — no query_id or other fields |
The delta Event
delta events carry incremental text as the pipeline generates a response. The delta object holds the new text and metadata (including which component produced the chunk). Three optional top-level fields help you format output across multiple streaming components:
| Field | Type | When present | Meaning |
|---|---|---|---|
start | boolean | First chunk from a component | true marks the beginning of output from that component |
finish_reason | string | Last chunk from a component | Why generation stopped: stop, length, tool_calls, content_filter, or tool_call_results |
index | integer | Multiple streaming components | Identifies which content part is being updated |
{
"type": "delta",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b",
"delta": {
"text": "Hello",
"meta": {
"deepset_cloud": {
"component": "llm_1"
}
}
},
"start": true,
"finish_reason": "stop",
"index": 0
}
start, finish_reason, and index are omitted when they do not apply to a given chunk.
The done event
The done event is the last event in a successful stream. When you receive it, the stream is complete and you can stop listening.
{
"type": "done",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b"
}
When a pipeline fails, an error event is sent instead of done. The error frame is terminal — no further events follow it.
The error event
Every failed stream ends with an error event. The event includes a human-readable error message and an error_category field that tells you whether the caller or the platform is responsible for the failure.
{
"type": "error",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b",
"error": "Invalid pipeline configuration.",
"error_category": "user_error"
}
The error_category field can have one of these values:
| Value | Meaning | Examples |
|---|---|---|
user_error | The caller can fix this | Invalid request input, Pydantic validation errors, provider 4xx responses |
system_error | A platform or pipeline failure | Pipeline runtime errors, schema mismatches, provider 5xx responses |
unknown | Cause could not be classified | Catch-all for unrecognised failures |
timeout | The request exceeded the time limit | Pipeline did not finish within the allowed duration |
If you maintain a strict list of known event types or categories in your client code, include done and all four error_category values so your application does not break when it receives them.
The ping event
The server sends ping events periodically to keep the connection alive during long-running streams. Unlike every other event type, ping carries no query_id — only the type field:
{
"type": "ping"
}
You can safely ignore ping events in your client logic.
Here is an example showing how to handle all event types. The example sets include_tool_calls to true so that structured tool_call_delta events are emitted — omit this field to use the default "rendered" mode, where tool calls appear as markdown inside delta events instead:
import httpx
import json
from httpx_sse import EventSource
import asyncio
TOKEN = "DEEPSET_API_KEY"
PIPELINE_URL = "https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME"
async def main():
query = {
"query": "How does streaming work with deepset?",
"include_result": True,
# Default is "rendered" (tool calls as markdown in delta events).
# Set to true to receive structured tool_call_delta events instead:
"include_tool_calls": True,
"include_tool_call_results": True,
}
headers = {
"Authorization": f"Bearer {TOKEN}"
}
async with httpx.AsyncClient(base_url=PIPELINE_URL, headers=headers, timeout=httpx.Timeout(300.0)) as client:
async with client.stream("POST", "/search-stream", json=query) as response:
if response.status_code != 200:
await response.aread()
print(f"An error occurred with status code: {response.status_code}")
print(response.json()["errors"][0])
return
event_source = EventSource(response)
async for event in event_source.aiter_sse():
event_data = json.loads(event.data)
chunk_type = event_data["type"]
match chunk_type:
case "delta":
delta = event_data["delta"]
if event_data.get("start"):
print(f"\n\nAnswer: ", flush=True, end="")
token: str = delta["text"]
print(token, flush=True, end="\n" if event_data.get("finish_reason") else "")
case "result":
print("\n\nPipeline result: ")
print(json.dumps(event_data["result"]))
case "done":
# The stream completed successfully — stop processing
print("\n\nStream complete.")
break
case "error":
print("\n\nAn error occurred while streaming:")
print(event_data["error"])
if category := event_data.get("error_category"):
print(f"Category: {category}")
break
case "tool_call_delta":
tool_call_delta = event_data["tool_call_delta"]
if tool_call_delta["tool_name"]:
tool_id = tool_call_delta["id"]
tool_name = tool_call_delta["tool_name"]
print(f"\n\nTool call {tool_id} started {tool_name} with arguments: ")
elif tool_call_delta["arguments"]:
print(tool_call_delta["arguments"], flush=True, end="")
case "tool_call_result":
tool_call_result = event_data["tool_call_result"]
tool_id = tool_call_result["origin"]["id"]
tool_name = tool_call_result["origin"]["tool_name"]
if tool_call_result["error"]:
print(f"\n\nTool call {tool_name} with id {tool_id} failed.")
else:
print(f"\n\nTool call {tool_name} with id {tool_id} result:")
print(tool_call_result["result"])
case "reasoning":
reasoning = event_data["reasoning"]
if event_data.get("start"):
print(f"\n\nReasoning: ")
print(f"{reasoning['reasoning_text']}", flush=True, end="")
case "ping":
continue # keep-alive; no query_id on this event type
asyncio.run(main())
Replace:
WORKSPACE_NAME: With the name of the workspace containing your pipeline.PIPELINE_NAME: With the name of the pipeline to use for search.DEEPSET_API_KEY: With your Haystack Platform API key.
Configuring Component Outputs in Streaming
By default, streaming endpoints return outputs from all components in your pipeline. You can control which component outputs are included using the include_outputs_from parameter. This parameter accepts an array of component names.
For example, to only receive outputs from specific components:
{
"query": "your question here",
"include_outputs_from": ["retriever", "generator"]
}
If include_outputs_from is not specified, the streaming response will include outputs from all components in the pipeline.
Determining Which Generator Streamed
If your pipeline includes multiple Generators with streaming enabled, you can determine which Generator streamed a specific chunk of data by checking its name in the API response. This information is available in the delta field.
Below is a partial example of a response from the Search Stream endpoint, showing two streaming-enabled Generators: chat_summary_llm and qa_llm.
{
"query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
"delta":{
"text":"girl bands?",
"meta":{
"index":0,
"deepset_cloud":{
"component":"chat_summary_llm"
}
}
},
"type":"delta"
}
{
"query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
"delta":{
"text":"Base",
"meta":{
"index":0,
"deepset_cloud":{
"component":"qa_llm"
}
}
},
"type":"delta"
}
Related Information
Was this page helpful?