About this Task

deepset AI Platform provides the SnowflakeTableRetriever component that connects to your Snowflake database. To query your Snowflake data, you add this component to your query pipeline. It takes a SQL query as input and returns a database table that matches the query and the generated answer.

Prerequisites

You need the following information about your Snowflake database:

Snowflake account identifier
Snowflake user login
Warehouse name
Schema name
Database name

Query Your Snowflake Database

First, connect deepset AI Platform to Snowflake through the Integrations page. You can add the integration only for a particular workspace or for the whole organization:

Add Workspace-Level Integration

Click your profile icon and choose Settings.
Go to Workspace>Integrations.
Find the provider you want to connect and click Connect next to them.
Enter the API key and any other required details.
Click Connect. You can use this integration in pipelines and indexes in the current workspace.

Add Organization-Level Integration

Click your profile icon and choose Settings.
Go to Organization>Integrations.
Find the provider you want to connect and click Connect next to them.
Enter the API key and any other required details.
Click Connect. You can use this integration in pipelines and indexes in all workspaces in the current organization.

Then, add SnowflakeTableRetriever to your query pipeline or use one of our ready-made Text-to-SQL pipeline templates.

Usage Example

This is an example of a query pipeline with SnowflakeTableRetriever and two Generators: one to translate a natural language query into SQL and send it to SnowflakeTableRetriever and another to construct an answer.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/v2.0/docs/create-a-pipeline#create-a-pipeline-using-pipeline-editor.
# This section defines components that you want to use in your pipelines. Each component must have a name and a type. You can also set the component's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying the connections in the pipeline.
# Type is the class path of the component. You can check the type on the component's documentation page.
components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20 # The number of results to return
      fuzziness: 0

  query_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
    init_parameters:
      normalize_embeddings: true
      model: "BAAI/bge-m3"

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
    init_parameters:
      model: "BAAI/bge-reranker-v2-m3"
      top_k: 5

  sql_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are a SQL expert working with Snowflake.
        Your task is to create a Snowflake SQL query for the given question.
        Refrain from explaining your answer. Your answer must be the SQL query in plain text format without using Markdown.
        Here are some relevant tables, a description about them, and their columns:
        {% for document in documents %}
        Document [{{ loop.index }}] :
        {{ document.content }}
        {% endfor %}

        User's question: {{ question }}    
        Generated SQL query:

  sql_llm:
    type: >-
      deepset_cloud_custom_nodes.generators.deepset_amazon_bedrock_generator.DeepsetAmazonBedrockGenerator
    init_parameters:
      model: anthropic.claude-3-5-sonnet-20240620-v1:0
      aws_region_name: us-east-1
      max_length: 1000
      model_max_length: 200000
      temperature: 0

  snowflake_retriever:
    type: haystack_integrations.components.retrievers.snowflake.snowflake_table_retriever.SnowflakeTableRetriever
    init_parameters:
      user: "<snowflake-user-identifier>"
      account: "<snowflake-account-identifier>"
      warehouse: "<snowflake-warehouse-identifier>"
      database: "<snowflake-database-identifier>"
      db_schema: "<snowflake-schema-identifier>"

  replies_to_sql:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: str

  display_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are an expert data analyst.
        Your role is to answer the user's question {{ question }} using the information in the table.
        You will base your response solely on the information provided in the table(s).
        Do not rely on your knowledge base; only the data that is in the table.
        Refrain from using the term "table" in your response, but instead, use the word "data".

        If the table is blank say:
        "The specific answer can't be found in the database. Try rephrasing your question."

        Additionally, you will present the table in a tabular format and provide the SQL query used to extract the relevant rows from the database in Markdown.

        If the table is larger than 10 rows, display the most important rows up to 10 rows.
        Your answer must be detailed and provide insights based on the question and the available data.

        SQL query:
        {{ sql_query }}

        Table:
        {{ table }}

        Answer:

  display_llm:
    type: deepset_cloud_custom_nodes.generators.deepset_amazon_bedrock_generator.DeepsetAmazonBedrockGenerator
    init_parameters:
      model: anthropic.claude-3-5-sonnet-20240620-v1:0
      aws_region_name: us-east-1
      max_length: 2000
      model_max_length: 200000
      temperature: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

connections:
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: ranker.documents
  receiver: sql_prompt_builder.documents
- sender: ranker.documents
  receiver: answer_builder.documents
- sender: sql_prompt_builder.prompt
  receiver: sql_llm.prompt
- sender: sql_llm.replies
  receiver: replies_to_sql.replies
- sender: replies_to_sql.output
  receiver: snowflake_retriever.query
- sender: snowflake_retriever.table
  receiver: display_prompt_builder.table
- sender: replies_to_sql.output
  receiver: display_prompt_builder.sql_query
- sender: display_prompt_builder.prompt
  receiver: display_llm.prompt
- sender: display_prompt_builder.prompt
  receiver: answer_builder.prompt
- sender: display_llm.replies
  receiver: answer_builder.replies

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "sql_prompt_builder.question"
  - "display_prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

max_runs_per_component: 100

metadata: {}