SentenceTransformersSparseDocumentEmbedder

Calculates sparse embeddings for documents using Sentence Transformers models.

Basic Information

Type: haystack.components.embedders.sentence_transformers_sparse_document_embedder.SentenceTransformersSparseDocumentEmbedder
Components it can connect with:
- Any component that produces documents. It's usually used in indexes after Preprocessors, like DocumentSplitter.
- Any component that consumes documents, such as DocumentWriter.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Documents with sparse embeddings added in the `sparse_embedding` field.

Overview

The SentenceTransformersSparseDocumentEmbedder calculates document sparse embeddings using sparse embedding models from Sentence Transformers. It stores the sparse embeddings in the sparse_embedding metadata field of each document. You can also embed documents' metadata.

Use this component in indexes to embed input documents and send them to DocumentWriter to write into a Document Store.

Embedding Models in Query Pipelines and Indexes

The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.

This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.

Usage Example

This index uses SentenceTransformersSparseDocumentEmbedder to create sparse embeddings for documents:

components:
  FileTypeRouter:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
      - text/plain
      - application/pdf
      - text/markdown

  TextFileToDocument:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8
      store_full_path: false

  PDFMinerToDocument:
    type: haystack.components.converters.pdfminer.PDFMinerToDocument
    init_parameters:
      store_full_path: false

  MarkdownToDocument:
    type: haystack.components.converters.markdown.MarkdownToDocument
    init_parameters:
      store_full_path: false

  DocumentJoiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  DocumentSplitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en

  SparseDocumentEmbedder:
    type: haystack.components.embedders.sentence_transformers_sparse_document_embedder.SentenceTransformersSparseDocumentEmbedder
    init_parameters:
      model: prithivida/Splade_PP_en_v2
      batch_size: 32
      progress_bar: true

  DocumentWriter:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

connections:
- sender: FileTypeRouter.text/plain
  receiver: TextFileToDocument.sources
- sender: FileTypeRouter.application/pdf
  receiver: PDFMinerToDocument.sources
- sender: FileTypeRouter.text/markdown
  receiver: MarkdownToDocument.sources
- sender: TextFileToDocument.documents
  receiver: DocumentJoiner.documents
- sender: PDFMinerToDocument.documents
  receiver: DocumentJoiner.documents
- sender: MarkdownToDocument.documents
  receiver: DocumentJoiner.documents
- sender: DocumentJoiner.documents
  receiver: DocumentSplitter.documents
- sender: DocumentSplitter.documents
  receiver: SparseDocumentEmbedder.documents
- sender: SparseDocumentEmbedder.documents
  receiver: DocumentWriter.documents

max_runs_per_component: 100

metadata: {}

inputs:
  files:
  - FileTypeRouter.sources

Parameters

Init Parameters

These are the parameters you can configure in Builder:

Parameter	Type	Default	Description
model	str	prithivida/Splade_PP_en_v2	The model to use for calculating sparse embeddings. Pass a local path or ID of the model on Hugging Face.
device	Optional[ComponentDevice]	None	The device to use for loading the model. Overrides the default device.
token	Optional[Secret]		The API token to download private models from Hugging Face.
prefix	str	""	A string to add at the beginning of each document text.
suffix	str	""	A string to add at the end of each document text.
batch_size	int	32	Number of documents to embed at once.
progress_bar	bool	True	If True, shows a progress bar when embedding documents.
meta_fields_to_embed	Optional[List[str]]	None	List of metadata fields to embed along with the document text.
embedding_separator	str	"\n"	Separator used to concatenate the metadata fields to the document text.
trust_remote_code	bool	False	If True, allows custom models and scripts.
local_files_only	bool	False	If True, only looks at local files without downloading from Hugging Face Hub.
model_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` when loading the model. Refer to specific model documentation for available kwargs.
tokenizer_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. Refer to specific model documentation for available kwargs.
config_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
backend	Literal["torch", "onnx", "openvino"]	torch	The backend to use for the Sentence Transformers model. Choose from `torch`, `onnx`, or `openvino`. Refer to the Sentence Transformers documentation for more information on acceleration and quantization options.
revision	Optional[str]	None	The specific model version to use. It can be a branch name, a tag name, or a commit ID for a stored model on Hugging Face.

Run Method Parameters

These are the parameters you can configure for the component's run() method.

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Embedding Models in Query Pipelines and Indexes

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​