VoyageDocumentEmbedder
Computes document embeddings using Voyage AI models and stores them in each document's embedding field for use in indexing pipelines.
Key Features
- Supports Voyage AI embedding models optimized for document retrieval.
input_typeparameter for document-specific embedding optimization.- Configurable output dimensions for
voyage-3-largeandvoyage-code-3models. - Processes documents in batches with an optional progress bar.
- Embeds metadata fields alongside document content.
Configuration
To use this component, connect Haystack Platform with Voyage AI first. For detailed instructions, see Use Voyage AI Models.
- Drag the
VoyageDocumentEmbeddercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- On the General tab:
- Enter the name of the Voyage AI embedding model to use. See Voyage Embeddings documentation for available models.
- Go to the Advanced tab to configure the API key,
input_type, truncation, batch size, and metadata fields to embed.
Connections
VoyageDocumentEmbedder accepts a list of documents as input. It outputs the same documents with embeddings stored in the embedding field.
Use this component in indexing pipelines. Connect a preprocessor like DocumentSplitter to its documents input, and connect its documents output to DocumentWriter.
Usage Example
This is an example index with VoyageDocumentEmbedder for document embedding:
components:
converter:
type: haystack.components.converters.multi_file_converter.MultiFileConverter
init_parameters:
encoding: utf-8
cleaner:
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
init_parameters:
remove_empty_lines: true
remove_extra_whitespaces: true
remove_repeated_substrings: false
keep_id: false
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
split_threshold: 0
document_embedder:
type: haystack_integrations.components.embedders.voyage_embedders.voyage_document_embedder.VoyageDocumentEmbedder
init_parameters:
api_key:
type: env_var
env_vars:
- VOYAGE_API_KEY
strict: false
model: voyage-3
input_type: document
truncate: true
prefix:
suffix:
output_dimension:
output_dtype: float
batch_size: 32
metadata_fields_to_embed:
embedding_separator: "\n"
progress_bar: true
timeout:
max_retries:
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'default'
max_chunk_bytes: 104857600
embedding_dim: 1024
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
connections:
- sender: converter.documents
receiver: cleaner.documents
- sender: cleaner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents to embed. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents with embeddings stored in the embedding field. | |
| meta | Dict[str, Any] | Metadata about the embedding operation. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| api_key | Secret | Secret.from_env_var('VOYAGE_API_KEY') | The Voyage AI API key. It can be explicitly provided or automatically read from the environment variable VOYAGE_API_KEY. |
| model | str | voyage-3 | The name of the Voyage model to use. See the Voyage Embeddings documentation for available models. |
| input_type | Optional[str] | None | Type of the input text. Set to "document" for indexing documents or "query" for search queries. When set, prepends an appropriate prompt to the text. |
| truncate | bool | True | Whether to truncate the input text to fit within the context length. If False, an error is raised when the text exceeds the context length. |
| prefix | str | "" | A string to add to the beginning of each text. |
| suffix | str | "" | A string to add to the end of each text. |
| output_dimension | Optional[int] | None | The dimension of the output embedding. Only supported by voyage-3-large and voyage-code-3 models. |
| output_dtype | str | float | The data type for the embeddings. Options: "float", "int8", "uint8", "binary", "ubinary". |
| batch_size | int | 32 | Number of documents to encode at once. |
| metadata_fields_to_embed | Optional[List[str]] | None | List of metadata fields to embed along with the document content. |
| embedding_separator | str | "\n" | Separator used to concatenate metadata fields to the document content. |
| progress_bar | bool | True | Whether to show a progress bar during processing. |
| timeout | Optional[int] | None | Timeout for Voyage AI client calls. If not set, it is inferred from the VOYAGE_TIMEOUT environment variable or set to 30. |
| max_retries | Optional[int] | None | Maximum retries if Voyage AI returns an internal error. If not set, it is inferred from the VOYAGE_MAX_RETRIES environment variable or set to five. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents to embed. |
Was this page helpful?