CohereDocumentEmbedder

Calculate document embeddings using Cohere models. Document embedders are used to embed documents in your index.

Basic Information

Type: haystack_integrations.components.embedders.cohere.document_embedder.CohereDocumentEmbedder
Components it can connect with:
- Converters and Preprocessors: CohereDocumentEmbedder can receive documents to embed from a converter, such as TextFileToDocument or a preprocessor, such as DocumentSplitter.
- DocumentWriter: CohereDocumentEmbedder sends embedded documents to DocumentWriter that writes them into a document store.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Documents with their embeddings added to `embedding` field.
meta	Dict[str, Any]		Metadata related to the embedding process.

Overview

CohereDocumentEmbedder uses Cohere models to embed a list of documents. It then adds the computed embeddings to the document's embedding metadata field. For a list of supported models, see the Cohere documentation.

Embedding Models in Query Pipelines and Indexes

The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.

This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.

Authorization

You need a Cohere API key to use this component. Connect deepset to your Cohere account on the Integrations page.

Add Workspace-Level Integration

Click your profile icon and choose Settings.
Go to Workspace>Integrations.
Find the provider you want to connect and click Connect next to them.
Enter the API key and any other required details.
Click Connect. You can use this integration in pipelines and indexes in the current workspace.

Add Organization-Level Integration

Click your profile icon and choose Settings.
Go to Organization>Integrations.
Find the provider you want to connect and click Connect next to them.
Enter the API key and any other required details.
Click Connect. You can use this integration in pipelines and indexes in all workspaces in the current organization.

Usage Example

Initializing the Component

components:
  CohereDocumentEmbedder:
    type: haystack_integrations.components.embedders.cohere.document_embedder.CohereDocumentEmbedder
    init_parameters:

Using the Component in an Index

In this index, CohereDocumentEmbedder receives documents from DocumentSplitter and embeds them. It then sends the embedded documents to DocumentWriter that writes them into a document store. The index is configured to use the embed-english-v2.0 model, which means CohereTextEmbedder used in the query pipeline must also use the embed-english-v2.0 model.

components:
  DocumentSplitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 200
      split_overlap: 0
      split_threshold: 0
      splitting_function:
  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: Standard-Index-English
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
          similarity: cosine
      policy: NONE

  CohereDocumentEmbedder:
    type: haystack_integrations.components.embedders.cohere.document_embedder.CohereDocumentEmbedder
    init_parameters:
      api_key:
        type: env_var
        env_vars:
        - COHERE_API_KEY
        - CO_API_KEY
        strict: false
      model: embed-english-v2.0
      input_type: search_document
      api_base_url: https://api.cohere.com
      truncate: END
      use_async_client: false
      timeout: 120
      batch_size: 32
      progress_bar: true
      meta_fields_to_embed:
      embedding_separator: \n
      embedding_type:
  TextFileToDocument:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8
      store_full_path: false

connections:
- sender: DocumentSplitter.documents
  receiver: CohereDocumentEmbedder.documents
- sender: CohereDocumentEmbedder.documents
  receiver: DocumentWriter.documents

- sender: TextFileToDocument.documents
  receiver: DocumentSplitter.documents

max_runs_per_component: 100

metadata: {}

inputs:
  files:
  - TextFileToDocument.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
api_key	Secret	Secret.from_env_var(['COHERE_API_KEY', 'CO_API_KEY'])	The Cohere API key.
model	str	embed-english-v2.0	The name of the model to use. Supported Models are: `"embed-english-v3.0"`, `"embed-english-light-v3.0"`, `"embed-multilingual-v3.0"`, `"embed-multilingual-light-v3.0"`, `"embed-english-v2.0"`, `"embed-english-light-v2.0"`, `"embed-multilingual-v2.0"`. For supported models, see Cohere model documentation.
input_type	str	search_document	Specifies the type of input you're giving to the model. Supported values are "search_document", "search_query", "classification" and "clustering". Not required for older versions of the embedding models (meaning any model lower than v3), but is required for more recent versions (meaning any model later than v2).
api_base_url	str	https://api.cohere.com	The Cohere API Base url.
truncate	str	END	Truncate embeddings that are too long from start or end, ("NONE"\|"START"\|"END"). Passing "START" discards the start of the input. "END" discards the end of the input. In both cases, input is discarded until the remaining input is exactly the maximum input token length for the model. If "NONE" is selected, when the input exceeds the maximum input token length, an error is returned.
timeout	int	120	request timeout in seconds.
batch_size	int	32	The number of Documents to encode at once.
progress_bar	bool	True	Whether to show a progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.
meta_fields_to_embed	Optional[List[str]]	None	List of meta fields that should be embedded along with the Document text.
embedding_separator	str	\n	Separator used to concatenate the meta fields to the Document text.
embedding_type	Optional[EmbeddingTypes]	None	The type of embeddings to return. Defaults to float embeddings. Note that int8, uint8, binary, and ubinary are only valid for v3 models.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Embedding Models in Query Pipelines and Indexes

Authorization​

Add Workspace-Level Integration​

Add Organization-Level Integration​

Usage Example​

Initializing the Component​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​