Connect to Your Cloud SQL for PostgreSQL Database

Connect your Cloud SQL for PostgreSQL database to deepset Cloud and run pipelines on your data. Keep in mind that with this setup, you are responsible for managing your index in Cloud SQL.

About this Task

deepset Cloud can connect to your Cloud SQL for PostgreSQL with the pgvector extension.

Cloud SQL for PostgreSQL is a fully managed relational database service from Google Cloud. It streamlines database management by handling backups, updates, and scaling so you can focus on your applications. For details, see Google Cloud documentation.

The pgvector extension enables vector similarity search for Postgres. For details, see the pgvector GitHub repository.

To enable deepset Cloud pipelines to query data in your Cloud SQL for PostgreSQL database, you can connect deepset Cloud to your database using the CloudSQLAuthProxy component. When included in your pipeline, CloudSQLAuthProxy creates a secure connection to your Google Cloud SQL database. You only need to add it to the pipeline—no additional connections to other components are required, as its sole role is establishing the database connection.

To learn more about the component, see CloudSQLAuthProxy.

Limitations

The following features don't work in this setup, as its an external database and deepset Cloud doesn't have access to this information:

  • Pipeline indexing status. The pipeline will show as partially indexed.
  • The number of indexed documents. They'll show as skipped.
  • Automatic index creation or deletion when deploying or undeploying pipelines. You'll need to manage your index in Cloud SQL. For guidance, refer to Cloud SQL documentation.

Query Your Cloud SQL Database

You need only add the CloudSQLAuthProxy component to your pipeline. It doesn't connect to any other component; its sole task is establishing a connection to Cloud SQL. Currently, CloudSQLAuthProxy is only available in the YAML editor.

Here is an example of how to add the component to the YAML configuration. You can add it at the beginning of your pipeline. You don't list it in the connections section as it doesn't connect to other components.

components:
  cloud_sql:
    type: deepset_cloud_custom_nodes.auth.cloud_sql_auth.CloudSQLAuthProxy
    init_parameters:
      url: ""https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.darwin.arm64"
      instance_connection_name: ""careful-time-421813:us-central1:myinstance"
      # we store json_credentials in the CLOUD_SQL_DATABASE env variable from which they're read by default, so we don't specify them here
      
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
        - text/plain
        - application/pdf
        - text/markdown
        - text/html
        - application/vnd.openxmlformats-officedocument.wordprocessingml.document
        - application/vnd.openxmlformats-officedocument.presentationml.presentation
        - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

    ....
    
 connections:
  -sender: file_classifier.text/plain
   receiver: text_converter.sources
   ...

Related Links