DeepsetSnowflakeRetriever

Use this component to connect to a Snowflake database and execute an SQL query.

Basic Information

  • Pipeline type: Used in query pipelines.
  • Type: deepset_cloud_custom_nodes.retrievers.snowflake_retriever.DeepsetSnowflakeRetriever
  • Components it can connect with:
    • DeepsetSnowflakeRetriever typically follows a Generator that turns the natural language query into SQL. To make the Generator's output compatible with DeepsetSnowflakeRetriever's input, add OutputAdapter between them. The Generator returns replies as a list of strings, while DeepsetSnowflakeRetriever requires a query in the form of a single string. The OutputAdapter can convert the list of strings into a single string. For more details, see the Usage Examples section.
    • DeepsetSnowflakeRetriever can pass its output table to a PromptBuilder, which then includes it in a rendered prompt for a Generator to create the final answer.

Inputs

NameTypeDescription
queryStringA SQL query to execute against a Snowflake database.

Outputs

NameTypeDescription
tableStringThe database table matching the query formatted as Markdown.
dataframepd.DataFrameA pandas dataframe (like an Excel spreadsheet) containing the results of your Snowflake database query

Overview

Snowflake is a cloud-based data warehousing platform that provides an SQL database engine. With DeepsetSnowflakeRetriever, you can pass all the details needed to query your data in Snowflake with your deepset Cloud pipeline.

You can use DeepsetSnowflakeRetriever on its own. Then, you need to provide an SQL query to it, and you'll get a database table formatted as Markdown and the query as an output. However, a common way is to use DeepsetSnowflakeRetriever preceded by a Generator instructed to turn the user query into SQL followed by another Generator that answers the query based on the database tables DeepsetSnowflakeRetriever retrieved.

Usage Example

This is an example of a query pipeline with DeepsetSnowflakeRetriever and two Generators: one to translate a natural language query into SQL and send it to DeepsetSnowflakeRetriever and another to construct an answer.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/v2.0/docs/create-a-pipeline#create-a-pipeline-using-pipeline-editor.
# This section defines components that you want to use in your pipelines. Each component must have a name and a type. You can also set the component's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying the connections in the pipeline.
# Type is the class path of the component. You can check the type on the component's documentation page.
components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 1024
          similarity: cosine
      top_k: 20 # The number of results to return

  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: "BAAI/bge-m3"
      tokenizer_kwargs:
        model_max_length: 1024

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 1024
          similarity: cosine
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: "BAAI/bge-reranker-v2-m3"
      top_k: 5
      model_kwargs:
        torch_dtype: "torch.float16"
      tokenizer_kwargs:
        model_max_length: 1024

  sql_prompt_builder:
      type: haystack.components.builders.prompt_builder.PromptBuilder
      init_parameters:
        template: |-
          You are a SQL expert working with Snowflake.
  
          Your task is to create a Snowflake SQL query for the given question.
  
          Refrain from explaining your answer. Your answer must be the SQL query
          in plain text format without using Markdown.
  
          Here are some relevant tables, a description about them, and their
          columns:
  
          {% for document in documents %}
  
          Document[{{ loop.index }}]:
  
          {{ document.content }}
  
          {% endfor %}
  
          User's question: {{ question }}
  
          Generated SQL query:

  sql_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 650
        temperature: 0.0
        seed: 0

  snowflake_retriever:
    type: deepset_cloud_custom_nodes.retrievers.snowflake_retriever.DeepsetSnowflakeRetriever
    init_parameters:
      user: "<snowflake-user-identifier>"
      account: "<snowflake-account-identifier>"

  replies_to_sql:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: str

  display_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are an expert data analyst.

        Your role is to answer the user's question {{ question }} using the information
        in the table.

        You will base your response solely on the information provided in the
        table(s).

        Do not rely on your knowledge base; only the data that is in the table.

        Refrain from using the term "table" in your response, but instead, use
        the word "data".

        If the table is blank say:

        "The specific answer can't be found in the database. Try rephrasing your
        question."

        Additionally, you will present the table in a tabular format and provide
        the SQL query used to extract the relevant rows from the database in
        Markdown.

        If the table is larger than 10 rows, display the most important rows up
        to 10 rows. Your answer must be detailed and provide insights based on
        the question and the available data.

        SQL query:
        {{ sql_query }}

        Table:
        {{ table }}

        Answer:

  display_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 2000
        temperature: 0.0
        seed: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm
  
connections:
  - sender: bm25_retriever.documents
    receiver: document_joiner.documents
  - sender: query_embedder.embedding
    receiver: embedding_retriever.query_embedding
  - sender: embedding_retriever.documents
    receiver: document_joiner.documents
  - sender: document_joiner.documents
    receiver: ranker.documents
  - sender: ranker.documents
    receiver: sql_prompt_builder.documents
  - sender: ranker.documents
    receiver: answer_builder.documents
  - sender: sql_prompt_builder.prompt
    receiver: sql_llm.prompt
  - sender: sql_llm.replies
    receiver: replies_to_sql.replies
  - sender: replies_to_sql.output
    receiver: snowflake_retriever.query
  - sender: snowflake_retriever.table
    receiver: display_prompt_builder.table
  - sender: replies_to_sql.output
    receiver: display_prompt_builder.sql_query
  - sender: display_prompt_builder.prompt
    receiver: display_llm.prompt
  - sender: display_prompt_builder.prompt
    receiver: answer_builder.prompt
  - sender: display_llm.replies
    receiver: answer_builder.replies

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "sql_prompt_builder.question"
  - "display_prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/v2.0/docs/create-a-pipeline#create-a-pipeline-using-pipeline-editor.
# This section defines components that you want to use in your pipelines. Each component must have a name and a type. You can also set the component's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying the connections in the pipeline.
# Type is the class path of the component. You can check the type on the component's documentation page.
components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 1024
          similarity: cosine
      top_k: 20 # The number of results to return

  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: "BAAI/bge-m3"
      tokenizer_kwargs:
        model_max_length: 1024

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          use_ssl: True
          verify_certs: False
          hosts:
            - ${OPENSEARCH_HOST}
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
          embedding_dim: 1024
          similarity: cosine
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: "BAAI/bge-reranker-v2-m3"
      top_k: 5
      model_kwargs:
        torch_dtype: "torch.float16"
      tokenizer_kwargs:
        model_max_length: 1024

  sql_prompt_builder:
      type: haystack.components.builders.prompt_builder.PromptBuilder
      init_parameters:
        template: |-
          You are a SQL expert working with Snowflake.
  
          Your task is to create a Snowflake SQL query for the given question.
  
          Refrain from explaining your answer. Your answer must be the SQL query
          in plain text format without using Markdown.
  
          Here are some relevant tables, a description about them, and their
          columns:
  
          {% for document in documents %}
  
          Document[{{ loop.index }}]:
  
          {{ document.content }}
  
          {% endfor %}
  
          User's question: {{ question }}
  
          Generated SQL query:

  sql_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 650
        temperature: 0.0
        seed: 0

  snowflake_retriever:
    type: deepset_cloud_custom_nodes.retrievers.snowflake_retriever.DeepsetSnowflakeRetriever
    init_parameters:
      user: "<snowflake-user-identifier>"
      account: "<snowflake-account-identifier>"

  replies_to_sql:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: str

  display_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are an expert data analyst.

        Your role is to answer the user's question {{ question }} using the information
        in the table.

        You will base your response solely on the information provided in the
        table(s).

        Do not rely on your knowledge base; only the data that is in the table.

        Refrain from using the term "table" in your response, but instead, use
        the word "data".

        If the table is blank say:

        "The specific answer can't be found in the database. Try rephrasing your
        question."

        Additionally, you will present the table in a tabular format and provide
        the SQL query used to extract the relevant rows from the database in
        Markdown.

        If the table is larger than 10 rows, display the most important rows up
        to 10 rows. Your answer must be detailed and provide insights based on
        the question and the available data.

        SQL query:
        {{ sql_query }}

        Table:
        {{ table }}

        Answer:

  display_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": False}
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 2000
        temperature: 0.0
        seed: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm
  
connections:
  - sender: bm25_retriever.documents
    receiver: document_joiner.documents
  - sender: query_embedder.embedding
    receiver: embedding_retriever.query_embedding
  - sender: embedding_retriever.documents
    receiver: document_joiner.documents
  - sender: document_joiner.documents
    receiver: ranker.documents
  - sender: ranker.documents
    receiver: sql_prompt_builder.documents
  - sender: ranker.documents
    receiver: answer_builder.documents
  - sender: sql_prompt_builder.prompt
    receiver: sql_llm.prompt
  - sender: sql_llm.replies
    receiver: replies_to_sql.replies
  - sender: replies_to_sql.output
    receiver: snowflake_retriever.query
  - sender: snowflake_retriever.table
    receiver: display_prompt_builder.table
  - sender: replies_to_sql.output
    receiver: display_prompt_builder.sql_query
  - sender: display_prompt_builder.prompt
    receiver: display_llm.prompt
  - sender: display_prompt_builder.prompt
    receiver: answer_builder.prompt
  - sender: display_llm.replies
    receiver: answer_builder.replies

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "sql_prompt_builder.question"
  - "display_prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

You can use one of our ready-made Text-to-SQL pipeline templates.

Init Parameters

ParameterTypePossible valuesDescription
userStringUser's Snowflake login.
Required.
accountStringSnowflake account identifier.
Required.
api_keySecretDefault: Secret.from_env_var("SNOWFLAKE_API_KEY")Snowflake account password.
Required.
databaseString Default: NoneName of the database to use.
Optional.
db_schemaStringDefault: NoneName of the schema to use.
Optional.
warehouseString Default: NoneName of the warehouse to use.
Optional.
login_timeoutInteger Default: 60Timeout in seconds for login. The default is 60 seconds.
Optional.