Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

Enable Streaming

Streaming refers to a large language model generating text as it's produced rather than waiting for the entire response to be ready before showing it. It's similar to watching someone type real-time. Enable streaming for the LLMs in your pipelines.


About This Task

Streaming is a technique often used in chat interfaces. It makes the responses seem faster as users can immediately see the output and can start reading while the rest of the text generates. It also makes it possible to interrupt the LLM if needed. This is particularly useful for longer responses where waiting for the generation to complete may take a couple of seconds.

In Haystack Enterprise Platform, streaming is enabled by default for all LLMs.

Streaming in API endpoints

When using Haystack Platform through the API, use the stream endpoints to get streaming responses. For example, use the Chat Stream endpoint to get a streaming response from a chat pipeline or the Search Stream endpoint to get a streaming response from a search pipeline.

Configure Streaming Components

Use the top-level streaming_components field in your pipeline YAML to control which components stream their output. This is the recommended way to configure streaming. It works consistently across both the Playground and the API endpoints.

Enable Streaming for Specific Components

To specify which components should stream, list their names in the streaming_components field:

max_runs_per_component: 100
streaming_components:
- llm_1
- llm_2
- agent
components:
llm_1:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
# ... component configuration
llm_2:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
# ... component configuration
agent:
type: haystack.components.agents.agent.Agent
# ... component configuration

Enable Streaming for All Components

To enable streaming for all compatible components in your pipeline, use the wildcard value:

max_runs_per_component: 100
streaming_components: all
components:
# ... your pipeline components

Default Behavior

If you don't include the streaming_components field, only the last streaming-capable component in your pipeline streams its output.

Legacy: Streaming with streaming_callback

note

This approach is supported for backward compatibility with existing pipelines. For new pipelines, use the streaming_components field instead.

Older pipelines may have streaming enabled by setting the streaming_callback init parameter on individual components instead of using the streaming_components YAML field.

This approach only works predictably in the Playground. The API uses the same behavior by default: include_tool_calls defaults to "rendered", which embeds tool calls as inline markdown in delta events rather than emitting separate tool_call_delta events. See Stream request parameters for details.

For legacy pipelines, set streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback on each component you want to stream. All configured components receive streaming output through the API. This differs from the streaming_components field: when you omit that field, only the last streaming-capable component streams by default.

Generators and ChatGenerators

CohereGenerator:
type: haystack_integrations.components.generators.cohere.generator.CohereGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- COHERE_API_KEY
- CO_API_KEY
strict: false
model: command-r
streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback

Agents

For agents, streaming_callback is set at the agent level, not on the inner chat_generator:

agent:
type: haystack.components.agents.agent.Agent
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: gpt-4o
streaming_callback: # leave empty on the inner generator
system_prompt: You are a deep research assistant.
streaming_callback: deepset_cloud_custom_nodes.callbacks.streaming.streaming_callback
tools:
# ...

Streaming with API

You can use streaming with the stream API endpoints: Chat Stream and Search Stream. This is an example request to the Search Stream endpoint.

  curl --request POST \
--url https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME/search-stream \
--header 'accept: application/json' \
--header 'authorization: Bearer DEEPSET_API_KEY' \
--header 'content-type: application/json' \
--data '
{
"debug": false,
"include_result": true,
"view_prompts": false,
"query": "who started all-girl bands?"
}
'

Replace:

  • WORKSPACE_NAME: With the name of the workspace containing your pipeline.
  • PIPELINE_NAME: With the name of the pipeline to use for search.
  • DEEPSET_API_KEY: With your Haystack Platform API key.

Default Streaming Behavior in the API

By default, the streaming endpoints only stream output from the last streaming-capable component in your pipeline. This matches the default pipeline behavior described in the Default Behavior section above.

To stream output from more than one component or from a specific component, configure the streaming_components field in your pipeline YAML before using the API. See Enable Streaming for Specific Components for details.

Stream Request Parameters

These request body fields control which SSE event types you receive from search-stream and chat-stream:

ParameterDefaultEffect
include_resultfalseWhen true, emits a result event with the full pipeline output before the terminal done frame
include_tool_callsrenderedControls how agent tool calls appear in the stream (see below)
include_tool_call_resultsfalseWhen true, emits tool_call_result events after each tool completes
include_reasoningfalseWhen true, emits reasoning events during model reasoning steps

include_tool_calls

The default is rendered. This setting only affects agent pipelines that make tool calls:

ValueBehavior
rendered (default)Tool calls are rendered as inline markdown inside delta events. No tool_call_delta events are emitted.
trueTool calls emit structured tool_call_delta events as they progress
falseTool calls are omitted from the stream entirely

The rendered default matches Playground behavior and is the simplest option for chat-style UIs. Set include_tool_calls to true when you need programmatic access to tool names, arguments, and call IDs with tool_call_delta events.

Streaming Event Types

The streaming endpoints use Server-Sent Events (SSE) and return a sequence of events. Each event has a type field that identifies what it contains. Most events also include a query_id. The ping event is the only exception. The event types are mutually exclusive, each event carries exactly one of the following:

Event typeWhen it's sentWhat it contains
deltaDuring generationdelta object with text and meta; optional start (first chunk from a component), finish_reason (when generation ends), and index (component index when multiple components stream)
resultEnd of stream (when include_result=true)The full pipeline result
doneAfter the last event, on successSignals that the stream completed successfully. Not sent when an error occurs.
errorWhen the pipeline failsAn error message and an error_category that classifies who can fix the problem. Sent instead of done.
tool_call_deltaDuring agent tool use (only when include_tool_calls=true)Structured information about a tool call in progress. Not emitted when the default "rendered" mode is used
tool_call_resultAfter a tool call completes (when include_tool_call_results=true)The result returned by a tool
reasoningDuring agent reasoning (when include_reasoning=true)A reasoning step from the model
pingPeriodicallyKeep-alive signal. Contains only type — no query_id or other fields

The delta Event

delta events carry incremental text as the pipeline generates a response. The delta object holds the new text and metadata (including which component produced the chunk). Three optional top-level fields help you format output across multiple streaming components:

FieldTypeWhen presentMeaning
startbooleanFirst chunk from a componenttrue marks the beginning of output from that component
finish_reasonstringLast chunk from a componentWhy generation stopped: stop, length, tool_calls, content_filter, or tool_call_results
indexintegerMultiple streaming componentsIdentifies which content part is being updated
{
"type": "delta",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b",
"delta": {
"text": "Hello",
"meta": {
"deepset_cloud": {
"component": "llm_1"
}
}
},
"start": true,
"finish_reason": "stop",
"index": 0
}

start, finish_reason, and index are omitted when they do not apply to a given chunk.

The done event

The done event is the last event in a successful stream. When you receive it, the stream is complete and you can stop listening.

{
"type": "done",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b"
}

When a pipeline fails, an error event is sent instead of done. The error frame is terminal — no further events follow it.

The error event

Every failed stream ends with an error event. The event includes a human-readable error message and an error_category field that tells you whether the caller or the platform is responsible for the failure.

{
"type": "error",
"query_id": "290a1f96-57d6-4843-8ed7-2a224142398b",
"error": "Invalid pipeline configuration.",
"error_category": "user_error"
}

The error_category field can have one of these values:

ValueMeaningExamples
user_errorThe caller can fix thisInvalid request input, Pydantic validation errors, provider 4xx responses
system_errorA platform or pipeline failurePipeline runtime errors, schema mismatches, provider 5xx responses
unknownCause could not be classifiedCatch-all for unrecognised failures
timeoutThe request exceeded the time limitPipeline did not finish within the allowed duration

If you maintain a strict list of known event types or categories in your client code, include done and all four error_category values so your application does not break when it receives them.

The ping event

The server sends ping events periodically to keep the connection alive during long-running streams. Unlike every other event type, ping carries no query_id — only the type field:

{
"type": "ping"
}

You can safely ignore ping events in your client logic.

Here is an example showing how to handle all event types. The example sets include_tool_calls to true so that structured tool_call_delta events are emitted — omit this field to use the default "rendered" mode, where tool calls appear as markdown inside delta events instead:

import httpx
import json
from httpx_sse import EventSource
import asyncio

TOKEN = "DEEPSET_API_KEY"
PIPELINE_URL = "https://api.cloud.deepset.ai/api/v1/workspaces/WORKSPACE_NAME/pipelines/PIPELINE_NAME"


async def main():
query = {
"query": "How does streaming work with deepset?",
"include_result": True,
# Default is "rendered" (tool calls as markdown in delta events).
# Set to true to receive structured tool_call_delta events instead:
"include_tool_calls": True,
"include_tool_call_results": True,
}
headers = {
"Authorization": f"Bearer {TOKEN}"
}
async with httpx.AsyncClient(base_url=PIPELINE_URL, headers=headers, timeout=httpx.Timeout(300.0)) as client:
async with client.stream("POST", "/search-stream", json=query) as response:
if response.status_code != 200:
await response.aread()
print(f"An error occurred with status code: {response.status_code}")
print(response.json()["errors"][0])
return

event_source = EventSource(response)
async for event in event_source.aiter_sse():
event_data = json.loads(event.data)
chunk_type = event_data["type"]
match chunk_type:
case "delta":
delta = event_data["delta"]
if event_data.get("start"):
print(f"\n\nAnswer: ", flush=True, end="")
token: str = delta["text"]
print(token, flush=True, end="\n" if event_data.get("finish_reason") else "")
case "result":
print("\n\nPipeline result: ")
print(json.dumps(event_data["result"]))
case "done":
# The stream completed successfully — stop processing
print("\n\nStream complete.")
break
case "error":
print("\n\nAn error occurred while streaming:")
print(event_data["error"])
if category := event_data.get("error_category"):
print(f"Category: {category}")
break
case "tool_call_delta":
tool_call_delta = event_data["tool_call_delta"]
if tool_call_delta["tool_name"]:
tool_id = tool_call_delta["id"]
tool_name = tool_call_delta["tool_name"]
print(f"\n\nTool call {tool_id} started {tool_name} with arguments: ")
elif tool_call_delta["arguments"]:
print(tool_call_delta["arguments"], flush=True, end="")
case "tool_call_result":
tool_call_result = event_data["tool_call_result"]
tool_id = tool_call_result["origin"]["id"]
tool_name = tool_call_result["origin"]["tool_name"]
if tool_call_result["error"]:
print(f"\n\nTool call {tool_name} with id {tool_id} failed.")
else:
print(f"\n\nTool call {tool_name} with id {tool_id} result:")
print(tool_call_result["result"])
case "reasoning":
reasoning = event_data["reasoning"]
if event_data.get("start"):
print(f"\n\nReasoning: ")
print(f"{reasoning['reasoning_text']}", flush=True, end="")
case "ping":
continue # keep-alive; no query_id on this event type


asyncio.run(main())

Replace:

  • WORKSPACE_NAME: With the name of the workspace containing your pipeline.
  • PIPELINE_NAME: With the name of the pipeline to use for search.
  • DEEPSET_API_KEY: With your Haystack Platform API key.

Configuring Component Outputs in Streaming

By default, streaming endpoints return outputs from all components in your pipeline. You can control which component outputs are included using the include_outputs_from parameter. This parameter accepts an array of component names.

For example, to only receive outputs from specific components:

{
"query": "your question here",
"include_outputs_from": ["retriever", "generator"]
}

If include_outputs_from is not specified, the streaming response will include outputs from all components in the pipeline.

Determining Which Generator Streamed

If your pipeline includes multiple Generators with streaming enabled, you can determine which Generator streamed a specific chunk of data by checking its name in the API response. This information is available in the delta field.

Below is a partial example of a response from the Search Stream endpoint, showing two streaming-enabled Generators: chat_summary_llm and qa_llm.

{
"query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
"delta":{
"text":"girl bands?",
"meta":{
"index":0,
"deepset_cloud":{
"component":"chat_summary_llm"
}
}
},
"type":"delta"
}


{
"query_id":"290a1f96-57d6-4843-8ed7-2a224142398b",
"delta":{
"text":"Base",
"meta":{
"index":0,
"deepset_cloud":{
"component":"qa_llm"
}
}
},
"type":"delta"
}

Troubleshooting Pipelines and Indexes