SentenceTransformersDocumentEmbedder
Calculate document embeddings using Sentence Transformers models. Use this component in indexing pipelines to embed documents before writing them to a document store.
Key Features
- Calculates dense embeddings for documents using Sentence Transformers models.
- Stores embeddings in the
embeddingmetadata field of each document. - Supports embedding document metadata fields alongside document text.
- Configurable batch size and progress reporting.
- Supports L2 normalization for consistent embedding comparison.
Configuration
- Drag the
SentenceTransformersDocumentEmbeddercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set
modelto the Sentence Transformers model to use (for example,sentence-transformers/all-mpnet-base-v2). - Toggle
normalize_embeddingsto enable L2 normalization. - Set
batch_sizeto control how many documents are embedded at once.
- Set
- Go to the Advanced tab to configure
prefix,suffix,progress_bar,trust_remote_code,token, anddevice.
We recommend using models available through the DeepsetNvidia components instead of the Sentence Transformers models.. Add a DeepsetNvidia component to your pipeline and choose an appropriate model from the list.
Embedding Models in Query Pipelines and Indexes
The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.
This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.
When using custom embedding models, enable GPU acceleration in your index settings if your index is slow:
- Go to Indexes and click the index that contains the
SentenceTransformersDocumentEmbeddercomponent. You're redirected to the Index Details page. - Go to Settings and click the GPU Acceleration toggle to turn it on.
For details, see GPU Acceleration.
Connections
SentenceTransformersDocumentEmbedder receives a list of documents through its documents input, typically from a DocumentSplitter or other preprocessor. It outputs a documents list with the embedding field populated. Connect the output to DocumentWriter to write the embedded documents to a document store.
Source Code
To check this component's source code, open sentence_transformers_document_embedder.py in the Haystack repository.
Usage Examples
Basic Configuration
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
batch_size: 32
progress_bar: true
normalize_embeddings: false
embedding_separator: "\n"
trust_remote_code: false
Using the Component in a Pipeline
This index uses SentenceTransformersDocumentEmbedder to embed documents before writing them to a document store:
components:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
DocumentSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: sentence
split_length: 5
split_overlap: 1
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: sentence-transformers/all-mpnet-base-v2
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
prefix:
suffix:
batch_size: 32
progress_bar: true
normalize_embeddings: false
meta_fields_to_embed:
embedding_separator: "\n"
trust_remote_code: false
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
- ${OPENSEARCH_HOST}
http_auth:
- ${OPENSEARCH_USER}
- ${OPENSEARCH_PASSWORD}
use_ssl: true
verify_certs: false
policy: WRITE
connections:
- sender: TextFileToDocument.documents
receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- TextFileToDocument.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to embed. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents with embeddings populated. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | sentence-transformers/all-mpnet-base-v2 | The model to use for calculating embeddings. Pass a local path or the ID of the model on Hugging Face. |
device | Optional[ComponentDevice] | None | The device to use for loading the model. Overrides the default device. |
token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The API token to download private models from Hugging Face. |
prefix | str | A string to add at the beginning of each document text. Can be used to prepend an instruction, as required by some embedding models such as E5 and bge. | |
suffix | str | A string to add at the end of each document text. | |
batch_size | int | 32 | Number of documents to embed at once. |
progress_bar | bool | True | If True, shows a progress bar when embedding documents. |
normalize_embeddings | bool | False | If True, normalizes the embeddings using L2 normalization so that each embedding has a norm of 1. |
meta_fields_to_embed | Optional[List[str]] | None | List of metadata fields to embed along with the document text. |
embedding_separator | str | \n | Separator used to concatenate the metadata fields to the document text. |
trust_remote_code | bool | False | If False, allows only Hugging Face verified model architectures. If True, allows custom models and scripts. |
local_files_only | bool | False | If True, does not attempt to download the model from Hugging Face Hub and only looks at local files. |
truncate_dim | Optional[int] | None | The dimension to truncate sentence embeddings to. None does no truncation. If the model wasn't trained with Matryoshka Representation Learning, truncating embeddings can significantly affect performance. |
model_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments for AutoModelForSequenceClassification.from_pretrained when loading the model. |
tokenizer_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments for AutoTokenizer.from_pretrained when loading the tokenizer. |
config_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments for AutoConfig.from_pretrained when loading the model configuration. |
precision | Literal['float32', 'int8', 'uint8', 'binary', 'ubinary'] | float32 | The precision to use for the embeddings. All non-float32 precisions are quantized embeddings. Quantized embeddings are smaller and faster to compute, but may have lower accuracy. |
encode_kwargs | Optional[Dict[str, Any]] | None | Additional keyword arguments for SentenceTransformer.encode when embedding documents. |
backend | Literal['torch', 'onnx', 'openvino'] | torch | The backend to use for the Sentence Transformers model. Refer to the Sentence Transformers documentation for more information. |
revision | Optional[str] | None | The specific model version to use. It can be a branch name, a tag name, or a commit ID for a stored model on Hugging Face. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
documents | List[Document] | Documents to embed. |
Was this page helpful?