LlamaStackChatGenerator

Generate text using models available on Llama Stack server.

Basic Information

Type: haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator
Components it can connect with:
- ChatPromptBuilder: LlamaStackChatGenerator receives a rendered prompt from ChatPromptBuilder.
- DeepsetAnswerBuilder: LlamaStackChatGenerator sends the generated replies to DeepsetAnswerBuilder through OutputAdapter.

Inputs

Parameter	Type	Default	Description
messages	List[ChatMessage]		A list of `ChatMessage` instances representing the input messages.
streaming_callback	Optional[StreamingCallbackT]	None	A callback function called when the model receives a new token from the stream.
generation_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments for text generation. These parameters override the parameters in pipeline configuration.
tools	Optional[Union[List[Tool], Toolset]]	None	A list of tools or a Toolset for which the model can prepare calls. If set, it overrides the `tools` parameter set during component initialization.
tools_strict	Optional[bool]	None	Whether to enable strict schema adherence for tool calls.

Outputs

Parameter	Type	Default	Description
replies	List[ChatMessage]		A list containing the generated responses as `ChatMessage` instances.

Overview

Use LlamaStackChatGenerator to generate text with models available on the Llama Stack server. Llama Stack Server supports multiple inference providers, including Ollama, Together AI, vLLM, and other cloud providers.

For a complete list of providers, see the Llama Stack documentation.

You can pass any text generation parameters valid for the OpenAI chat completion API directly to this component using the generation_kwargs parameter.

This component uses the ChatMessage format for structuring both input and output, ensuring coherent and contextually relevant responses in chat-based text generation scenarios.

Prerequisites

To use this chat generator, set up a Llama Stack Server with an inference provider and have a model available. For a quick start on how to set up the server with Ollama, see the Llama Stack documentation.

Usage Example

This is an example RAG pipeline with LlamaStackChatGenerator and DeepsetAnswerBuilder connected through OutputAdapter:

components:
  bm25_retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: 'Standard-Index-English'
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20
      fuzziness: 0

  query_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
    init_parameters:
      normalize_embeddings: true
      model: intfloat/e5-base-v2

  embedding_retriever:
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: 'Standard-Index-English'
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
    init_parameters:
      model: intfloat/simlm-msmarco-reranker
      top_k: 8

  meta_field_grouping_ranker:
    type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
    init_parameters:
      group_by: file_id
      subgroup_by:
      sort_docs_by: split_id

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

  ChatPromptBuilder:
    type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
    init_parameters:
      template:
      - _content:
        - text: "You are a helpful assistant answering the user's questions based on the provided documents.\nDo not use your own knowledge.\n"
        _role: system
      - _content:
        - text: "Provided documents:\n{% for document in documents %}\nDocument [{{ loop.index }}] :\n{{ document.content }}\n{% endfor %}\n\nQuestion: {{ query }}\n"
        _role: user
      required_variables:
      variables:

  OutputAdapter:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: List[str]
      custom_filters:
      unsafe: false

  LlamaStackChatGenerator:
    type: haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator
    init_parameters:
      model: ollama/llama3.2:3b
      api_base_url: http://localhost:8321/v1/openai/v1
      streaming_callback:
      generation_kwargs:
      tools:
      timeout:
      max_retries:
      http_client_kwargs:

connections:
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: ranker.documents
  receiver: meta_field_grouping_ranker.documents
- sender: meta_field_grouping_ranker.documents
  receiver: answer_builder.documents
- sender: meta_field_grouping_ranker.documents
  receiver: ChatPromptBuilder.documents
- sender: OutputAdapter.output
  receiver: answer_builder.replies
- sender: ChatPromptBuilder.prompt
  receiver: LlamaStackChatGenerator.messages
- sender: LlamaStackChatGenerator.replies
  receiver: OutputAdapter.replies

inputs:
  query:
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "answer_builder.query"
  - "ChatPromptBuilder.query"
  filters:
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:
  documents: "meta_field_grouping_ranker.documents"
  answers: "answer_builder.answers"

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
model	str		The name of the model to use for chat completion. This depends on the inference provider used for the Llama Stack Server.
api_base_url	str	http://localhost:8321/v1/openai/v1	The Llama Stack API base URL. If not specified, localhost is used with the default port 8321.
organization	Optional[str]	None	Your organization ID.
streaming_callback	Optional[StreamingCallbackT]	None	A callback function called when a new token is received from the stream. The callback function accepts `StreamingChunk` as an argument.
generation_kwargs	Optional[Dict[str, Any]]	None	Other parameters to use for the model. These parameters are sent directly to the Llama Stack endpoint. See the Llama Stack API documentation for more details. Supported parameters include: `max_tokens` (maximum number of tokens the output text can have), `temperature` (sampling temperature for creativity control), `top_p` (nucleus sampling probability mass), `stream` (whether to stream back partial progress), `safe_prompt` (whether to inject a safety prompt before all conversations), `random_seed` (the seed to use for random sampling), `response_format` (a JSON schema or Pydantic model that enforces the structure of the model's response).
timeout	Optional[int]	None	Timeout for client calls. If not set, it defaults to either the `OPENAI_TIMEOUT` environment variable or 30 seconds.
tools	Optional[Union[List[Tool], Toolset]]	None	A list of tools or a Toolset for which the model can prepare calls. Each tool should have a unique name.
tools_strict	bool	False	Whether to enable strict schema adherence for tool calls. If set to `True`, the model follows exactly the schema provided in the `parameters` field of the tool definition, but this may increase latency.
max_retries	Optional[int]	None	Maximum number of retries to contact the server after an internal error. If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable or five.
http_client_kwargs	Optional[Dict[str, Any]]	None	A dictionary of keyword arguments to configure a custom `httpx.Client` or `httpx.AsyncClient`. For more information, see the HTTPX documentation.

Run Method Parameters

These are the parameters you can configure for the component's run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.

Parameter	Type	Default	Description
messages	List[ChatMessage]		A list of `ChatMessage` instances representing the input messages.
streaming_callback	Optional[StreamingCallbackT]	None	A callback function called when a new token is received from the stream.
generation_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments for text generation. These parameters override the parameters in pipeline configuration.
tools	Optional[Union[List[Tool], Toolset]]	None	A list of tools or a Toolset for which the model can prepare calls. If set, it overrides the `tools` parameter set during component initialization.
tools_strict	Optional[bool]	None	Whether to enable strict schema adherence for tool calls.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Prerequisites​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​