Skip to main content

FastembedDocumentEmbedder

FastembedDocumentEmbedder computes Document embeddings using Fastembed embedding models.

Basic Information

  • Type: haystack_integrations.fastembed.src.haystack_integrations.components.embedders.fastembed.fastembed_document_embedder.FastembedDocumentEmbedder

Inputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to embed.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A dictionary with the following keys: - documents: List of Documents with each Document's embedding field set to the computed embeddings.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

FastembedDocumentEmbedder computes Document embeddings using Fastembed embedding models. The embedding of each Document is stored in the embedding field of the Document.

Usage example:

# To use this component, install the "fastembed-haystack" package.
# pip install fastembed-haystack

from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder
from haystack.dataclasses import Document

doc_embedder = FastembedDocumentEmbedder(
model="BAAI/bge-small-en-v1.5",
batch_size=256,
)

doc_embedder.warm_up()

# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa)
document_list = [
Document(
content=("Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint "
"destruction. Radical species with oxidative activity, including reactive nitrogen species, "
"represent mediators of inflammation and cartilage damage."),
meta={
"pubid": "25,445,628",
"long_answer": "yes",
},
),
Document(
content=("Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic "
"islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion "
"and actions are still poorly understood."),
meta={
"pubid": "25,445,712",
"long_answer": "yes",
},
),
]

result = doc_embedder.run(document_list)
print(f"Document Text: {result['documents'][0].content}")
print(f"Document Embedding: {result['documents'][0].embedding}")
print(f"Embedding Dimension: {len(result['documents'][0].embedding)}")

Usage Example

components:
FastembedDocumentEmbedder:
type: fastembed.src.haystack_integrations.components.embedders.fastembed.fastembed_document_embedder.FastembedDocumentEmbedder
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
modelstrBAAI/bge-small-en-v1.5Local path or name of the model in Hugging Face's model hub, such as BAAI/bge-small-en-v1.5.
cache_dirOptional[str]NoneThe path to the cache directory. Can be set using the FASTEMBED_CACHE_PATH env variable. Defaults to fastembed_cache in the system's temp directory.
threadsOptional[int]NoneThe number of threads single onnxruntime session can use. Defaults to None.
prefixstrA string to add to the beginning of each text.
suffixstrA string to add to the end of each text.
batch_sizeint256Number of strings to encode at once.
progress_barboolTrueIf True, displays progress bar during embedding.
parallelOptional[int]NoneIf > 1, data-parallel encoding will be used, recommended for offline encoding of large datasets. If 0, use all available cores. If None, don't use data-parallel processing, use default onnxruntime threading instead.
local_files_onlyboolFalseIf True, only use the model files in the cache_dir.
meta_fields_to_embedOptional[List[str]]NoneList of meta fields that should be embedded along with the Document content.
embedding_separatorstr\nSeparator used to concatenate the meta fields to the Document content.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]List of Documents to embed.