SnowflakeTableRetriever

Use this component to connect to a Snowflake database and execute an SQL query.

Basic Information

  • Type: haystack_integrations.components.retrievers.snowflake.snowflake_table_retriever.SnowflakeTableRetriever
  • Components it can connect with:
    • Typically follows a Generator that turns the natural language query into SQL. To make the Generator's output compatible with SnowflakeTableRetriever's input, add OutputAdapter between them. The Generator returns replies as a list of strings, while SnowflakeTableRetriever requires a query in the form of a single string. The OutputAdapter can convert the list of strings into a single string. For more details, see the Usage Examples section.
    • SnowflakeTableRetriever can pass its output table to a PromptBuilder, which then includes it in a rendered prompt for a Generator to create the final answer.

Inputs

Required Inputs

NameTypeDescription
queryStringA SQL query to execute against a Snowflake database.

Optional Inputs

NameTypeDefaultDescription
return_markdownBooleanIf not provided, uses the value set in init parameters.Returns a Markdown-formatted string of the DataFrame.

Outputs

NameTypeDescription
tableStringThe database table matching the query formatted as Markdown.
dataframepd.DataFrameA pandas dataframe (like an Excel spreadsheet) containing the results of your Snowflake database query.

Overview

Snowflake is a cloud-based data warehousing platform that provides an SQL database engine. With SnowflakeTableRetriever, you can pass all the details needed to query your data in Snowflake with your deepset pipeline.

You can use SnowflakeTableRetriever on its own. Then, you need to provide an SQL query to it, and you'll get a database table formatted as Markdown and the query as an output. However, a common way is to use SnowflakeTableRetriever preceded by a Generator instructed to turn the user query into SQL followed by another Generator that answers the query based on the database tables SnowflakeTableRetriever retrieved.

Usage Example

This is an example of a query pipeline with SnowflakeTableRetriever and two Generators: one to translate a natural language query into SQL and send it to SnowflakeTableRetriever and another to construct an answer.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/v2.0/docs/create-a-pipeline#create-a-pipeline-using-pipeline-editor.
# This section defines components that you want to use in your pipelines. Each component must have a name and a type. You can also set the component's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying the connections in the pipeline.
# Type is the class path of the component. You can check the type on the component's documentation page.
components:
  bm25_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20 # The number of results to return
      fuzziness: 0

  query_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
    init_parameters:
      normalize_embeddings: true
      model: "BAAI/bge-m3"

  embedding_retriever: # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20 # The number of results to return

  document_joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  ranker:
    type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
    init_parameters:
      model: "BAAI/bge-reranker-v2-m3"
      top_k: 5

  sql_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are a SQL expert working with Snowflake.
        Your task is to create a Snowflake SQL query for the given question.
        Refrain from explaining your answer. Your answer must be the SQL query in plain text format without using Markdown.
        Here are some relevant tables, a description about them, and their columns:
        {% for document in documents %}
        Document [{{ loop.index }}] :
        {{ document.content }}
        {% endfor %}

        User's question: {{ question }}    
        Generated SQL query:

  sql_llm:
    type: >-
      deepset_cloud_custom_nodes.generators.deepset_amazon_bedrock_generator.DeepsetAmazonBedrockGenerator
    init_parameters:
      model: anthropic.claude-3-5-sonnet-20240620-v1:0
      aws_region_name: us-east-1
      max_length: 1000
      model_max_length: 200000
      temperature: 0

  snowflake_retriever:
    type: haystack_integrations.components.retrievers.snowflake.snowflake_table_retriever.SnowflakeTableRetriever
    init_parameters:
      user: "<snowflake-user-identifier>"
      account: "<snowflake-account-identifier>"
      warehouse: "<snowflake-warehouse-identifier>"
      database: "<snowflake-database-identifier>"
      db_schema: "<snowflake-schema-identifier>"

  replies_to_sql:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: str

  display_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are an expert data analyst.
        Your role is to answer the user's question {{ question }} using the information in the table.
        You will base your response solely on the information provided in the table(s).
        Do not rely on your knowledge base; only the data that is in the table.
        Refrain from using the term "table" in your response, but instead, use the word "data".

        If the table is blank say:
        "The specific answer can't be found in the database. Try rephrasing your question."

        Additionally, you will present the table in a tabular format and provide the SQL query used to extract the relevant rows from the database in Markdown.

        If the table is larger than 10 rows, display the most important rows up to 10 rows.
        Your answer must be detailed and provide insights based on the question and the available data.

        SQL query:
        {{ sql_query }}

        Table:
        {{ table }}

        Answer:

  display_llm:
    type: deepset_cloud_custom_nodes.generators.deepset_amazon_bedrock_generator.DeepsetAmazonBedrockGenerator
    init_parameters:
      model: anthropic.claude-3-5-sonnet-20240620-v1:0
      aws_region_name: us-east-1
      max_length: 2000
      model_max_length: 200000
      temperature: 0

  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

connections:
- sender: bm25_retriever.documents
  receiver: document_joiner.documents
- sender: query_embedder.embedding
  receiver: embedding_retriever.query_embedding
- sender: embedding_retriever.documents
  receiver: document_joiner.documents
- sender: document_joiner.documents
  receiver: ranker.documents
- sender: ranker.documents
  receiver: sql_prompt_builder.documents
- sender: ranker.documents
  receiver: answer_builder.documents
- sender: sql_prompt_builder.prompt
  receiver: sql_llm.prompt
- sender: sql_llm.replies
  receiver: replies_to_sql.replies
- sender: replies_to_sql.output
  receiver: snowflake_retriever.query
- sender: snowflake_retriever.table
  receiver: display_prompt_builder.table
- sender: replies_to_sql.output
  receiver: display_prompt_builder.sql_query
- sender: display_prompt_builder.prompt
  receiver: display_llm.prompt
- sender: display_prompt_builder.prompt
  receiver: answer_builder.prompt
- sender: display_llm.replies
  receiver: answer_builder.replies

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "bm25_retriever.query"
  - "query_embedder.text"
  - "ranker.query"
  - "sql_prompt_builder.question"
  - "display_prompt_builder.question"
  - "answer_builder.query"

  filters:  # These components will receive a potential query filter as input
  - "bm25_retriever.filters"
  - "embedding_retriever.filters"

outputs:  # Defines the output of your pipeline
  documents: "ranker.documents"  # The output of the pipeline is the retrieved documents
  answers: "answer_builder.answers"  # The output of the pipeline is the generated answers

max_runs_per_component: 100

metadata: {}

You can use one of our ready-made Text-to-SQL pipeline templates.

For more information, see also Haystack documentation.

Init Parameters

ParameterTypePossible valuesDescription
userStringUser's Snowflake login.
Required.
accountStringSnowflake account identifier.
Required.
api_keySecretDefault: Secret.from_env_var("SNOWFLAKE_API_KEY")Snowflake account password.
Required.
databaseString Default: NoneName of the database to use.
Optional.
db_schemaStringDefault: NoneName of the schema to use.
Optional.
warehouseString Default: NoneName of the warehouse to use.
Optional.
login_timeoutInteger Default: 60Timeout in seconds for login. The default is 60 seconds.
Optional.
return_markdownBooleanTrue
False
Default: True
Whether to return a Markdown-formatted string of the DataFrame.
Required.