PgvectorDocumentStore
Use Cloud SQL for PostgreSQL hosted on Google Cloud through the PgvectorDocumentStore.
Basic Informationโ
- Used with the following retrievers:
PgvectorEmbeddingRetrieverPgvectorKeywordRetriever
- Needs the
CloudSQLAuthProxycomponent to be present in a pipeline to set up the connection with the database. - Type:
haystack_integrations.document_stores.pgvector.document_store.PgvectorDocumentStore
Overviewโ
deepset AI Platform can connect to your Cloud SQL for PostgreSQL with the pgvector extension.
Cloud SQL for PostgreSQL is a fully managed relational database service from Google Cloud. It streamlines database management by handling backups, updates, and scaling so you can focus on your applications. For details, see Google Cloud documentation.
The pgvector extension enables vector similarity search for Postgres. For details, see the pgvector GitHub repository. In deepset AI Platform, it's represented as the PgvectorDocumentStore where your pipelines can access your data.
To enable deepset pipelines to query data in your Cloud SQL for PostgreSQL database, you connect deepset AI Platform to your database using the CloudSQLAuthProxy component. When included in your pipeline, CloudSQLAuthProxy creates a secure connection to your Google Cloud SQL database. You only need to add it to the pipelineโno additional connections to other components are required, as its sole role is establishing the database connection.
You then use a Pgvector retriever that accesses the PgvectorDocumentStore and fetches the relevant documents.
Limitations
The following features don't work in this setup, as it's an external database and deepset AI Platform doesn't have access to this information:
- Pipeline indexing status. The pipeline will show as
partially indexed. - The number of indexed documents. They'll show as
skipped. - Automatic index creation or deletion when deploying or undeploying pipelines. You'll need to manage your index in Cloud SQL. For guidance, refer to Cloud SQL documentation.
See also Haystack documentation on PgvectorDocumentStore.
components:
cloud_sql:
type: deepset_cloud_custom_nodes.auth.cloud_sql_auth.CloudSQLAuthProxy
init_parameters:
url: ""https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.darwin.arm64"
instance_connection_name: ""careful-time-421813:us-central1:myinstance"
# we store json_credentials in the CLOUD_SQL_DATABASE env variable from which they're read by default, so we don't specify them here
FileTypeRouter:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- text/markdown
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
MarkdownToDocument:
type: haystack.components.converters.markdown.MarkdownToDocument
init_parameters:
table_to_single_line: false
progress_bar: true
DocumentJoiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
weights: null
top_k: null
sort_by_score: true
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function: null
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.pgvector.document_store.PgvectorDocumentStore
init_parameters:
table_name: deepset_test
embedding_dimension: 768
vector_function: cosine_similarity
recreate_table: True,
search_strategy: hnsw
policy: NONE
connections:
- sender: FileTypeRouter.text/plain
receiver: TextFileToDocument.sources
- sender: FileTypeRouter.text/markdown
receiver: MarkdownToDocument.sources
- sender: TextFileToDocument.documents
receiver: DocumentJoiner.documents
- sender: MarkdownToDocument.documents
receiver: DocumentJoiner.documents
- sender: DocumentJoiner.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: DocumentWriter.documents
max_loops_allowed: 100
metadata: {}
inputs:
files:
- FileTypeRouter.sourcesInit Parametersโ
Check the PgvectorDocumentStore API reference in Haystack documentation.
Was this page helpful?