CohereDocumentEmbedder
Calculate document embeddings using Cohere models. Use this component in indexing pipelines to embed documents before writing them to a document store.
Key Features
- Uses Cohere models to embed a list of documents.
- Adds the computed embeddings to the document's
embeddingmetadata field. - Supports multiple Cohere embedding models. For a full list, see the Cohere documentation.
- Configurable batch size and progress bar for large document sets.
- Supports embedding additional metadata fields alongside document content.
Embedding Models in Query Pipelines and Indexes
The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.
This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.
Configuration
- Drag the
CohereDocumentEmbeddercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Select the embedding model to use. Make sure Haystack Platform is connected to your Cohere account. For details, see Use Cohere Models.
- Set the
input_typetosearch_documentfor indexing pipelines.
- Go to the Advanced tab to configure additional settings such as
truncate,timeout,batch_size,meta_fields_to_embed, andembedding_type.
Connections
CohereDocumentEmbedder receives documents from converters such as TextFileToDocument or preprocessors such as DocumentSplitter. It outputs embedded documents through its documents output, which you connect to DocumentWriter to write them into a document store.
Source Code
To check this component's source code, open document_embedder.py in the Haystack Core Integrations repository.
Usage Examples
Basic Configuration
CohereDocumentEmbedder:
type: haystack_integrations.components.embedders.cohere.document_embedder.CohereDocumentEmbedder
init_parameters:
api_key:
type: env_var
env_vars:
- COHERE_API_KEY
- CO_API_KEY
strict: false
model: embed-english-v2.0
input_type: search_document
api_base_url: https://api.cohere.com
truncate: END
use_async_client: false
timeout: 120
batch_size: 32
progress_bar: true
embedding_separator: \n
Using the Component in an Index
In this index, CohereDocumentEmbedder receives documents from DocumentSplitter and embeds them. It then sends the embedded documents to DocumentWriter that writes them into a document store. The index is configured to use the embed-english-v2.0 model, which means CohereTextEmbedder used in the query pipeline must also use the embed-english-v2.0 model.
components:
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 200
split_overlap: 0
split_threshold: 0
splitting_function:
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: Standard-Index-English
max_chunk_bytes: 104857600
embedding_dim: 1024
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
similarity: cosine
policy: NONE
CohereDocumentEmbedder:
type: haystack_integrations.components.embedders.cohere.document_embedder.CohereDocumentEmbedder
init_parameters:
api_key:
type: env_var
env_vars:
- COHERE_API_KEY
- CO_API_KEY
strict: false
model: embed-english-v2.0
input_type: search_document
api_base_url: https://api.cohere.com
truncate: END
use_async_client: false
timeout: 120
batch_size: 32
progress_bar: true
meta_fields_to_embed:
embedding_separator: \n
embedding_type:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
connections:
- sender: DocumentSplitter.documents
receiver: CohereDocumentEmbedder.documents
- sender: CohereDocumentEmbedder.documents
receiver: DocumentWriter.documents
- sender: TextFileToDocument.documents
receiver: DocumentSplitter.documents
max_runs_per_component: 100
metadata: {}
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to embed. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents with their embeddings added to embedding field. | |
| meta | Dict[str, Any] | Metadata related to the embedding process. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| api_key | Secret | Secret.from_env_var(['COHERE_API_KEY', 'CO_API_KEY']) | The Cohere API key. |
| model | str | embed-english-v2.0 | The name of the model to use. Supported Models are: "embed-english-v3.0", "embed-english-light-v3.0", "embed-multilingual-v3.0", "embed-multilingual-light-v3.0", "embed-english-v2.0", "embed-english-light-v2.0", "embed-multilingual-v2.0". For supported models, see Cohere model documentation. |
| input_type | str | search_document | Specifies the type of input you're giving to the model. Supported values are "search_document", "search_query", "classification" and "clustering". Not required for older versions of the embedding models (meaning any model lower than v3), but is required for more recent versions (meaning any model later than v2). |
| api_base_url | str | https://api.cohere.com | The Cohere API Base url. |
| truncate | str | END | Truncate embeddings that are too long from start or end, ("NONE"|"START"|"END"). Passing "START" discards the start of the input. "END" discards the end of the input. In both cases, input is discarded until the remaining input is exactly the maximum input token length for the model. If "NONE" is selected, when the input exceeds the maximum input token length, an error is returned. |
| timeout | int | 120 | Request timeout in seconds. |
| batch_size | int | 32 | The number of Documents to encode at once. |
| progress_bar | bool | True | Whether to show a progress bar or not. Can be helpful to disable in production deployments to keep the logs clean. |
| meta_fields_to_embed | Optional[List[str]] | None | List of meta fields that should be embedded along with the Document text. |
| embedding_separator | str | \n | Separator used to concatenate the meta fields to the Document text. |
| embedding_type | Optional[EmbeddingTypes] | None | The type of embeddings to return. Defaults to float embeddings. Note that int8, uint8, binary, and ubinary are only valid for v3 models. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | Documents to embed. |
Related Information
Was this page helpful?