InstructorDocumentEmbedder

A component for computing Document embeddings using INSTRUCTOR embedding models.

Basic Information

Type: haystack_integrations.instructor_embedders.src.haystack_integrations.components.embedders.instructor_embedders.instructor_document_embedder.InstructorDocumentEmbedder

Inputs

Parameter	Type	Default	Description
documents	List[Document]

Outputs

Parameter	Type	Default	Description
documents	List[Document]

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

A component for computing Document embeddings using INSTRUCTOR embedding models. The embedding of each Document is stored in the embedding field of the Document.

Usage example:

# To use this component, install the "instructor-embedders-haystack" package.
# pip install instructor-embedders-haystack

from haystack_integrations.components.embedders.instructor_embedders import InstructorDocumentEmbedder
from haystack.dataclasses import Document
from haystack.utils import ComponentDevice

doc_embedding_instruction = "Represent the Medical Document for retrieval:"
doc_embedder = InstructorDocumentEmbedder(
    model="hkunlp/instructor-base",
    instruction=doc_embedding_instruction,
    batch_size=32,
    device=ComponentDevice.from_str("cpu"),
)

doc_embedder.warm_up()

# Text taken from PubMed QA Dataset (https://huggingface.co/datasets/pubmed_qa)
document_list = [
    Document(
        content="Oxidative stress generated within inflammatory joints can produce autoimmune phenomena and joint destruction. Radical species with oxidative activity, including reactive nitrogen species, represent mediators of inflammation and cartilage damage.",
        meta={
            "pubid": "25,445,628",
            "long_answer": "yes",
        },
    ),
    Document(
        content="Plasma levels of pancreatic polypeptide (PP) rise upon food intake. Although other pancreatic islet hormones, such as insulin and glucagon, have been extensively investigated, PP secretion and actions are still poorly understood.",
        meta={
            "pubid": "25,445,712",
            "long_answer": "yes",
        },
    ),
]

result = doc_embedder.run(document_list)
print(f"Document Text: {result['documents'][0].content}")
print(f"Document Embedding: {result['documents'][0].embedding}")
print(f"Embedding Dimension: {len(result['documents'][0].embedding)}")

Usage Example

components:
  InstructorDocumentEmbedder:
    type: instructor_embedders.src.haystack_integrations.components.embedders.instructor_embedders.instructor_document_embedder.InstructorDocumentEmbedder
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
model	str	hkunlp/instructor-base	Local path or name of the model in Hugging Face's model hub, such as `'hkunlp/instructor-base'`.
device	Optional[ComponentDevice]	None	The device on which the model is loaded. If `None`, the default device is automatically selected.
token	Optional[Secret]	Secret.from_env_var('HF_API_TOKEN', strict=False)	An API token used to download private models from Hugging Face. If this parameter is set to `True`, then the token generated when running `transformers-cli login` (stored in ~/.huggingface) will be used.
instruction	str	Represent the document	The instruction string to be used while computing domain-specific embeddings. The instruction follows the unified template of the form: "Represent the 'domain' 'text_type' for 'task_objective'", where: - "domain" is optional, and it specifies the domain of the text, e.g., science, finance, medicine, etc. - "text_type" is required, and it specifies the encoding unit, e.g., sentence, document, paragraph, etc. - "task_objective" is optional, and it specifies the objective of embedding, e.g., retrieve a document, classify the sentence, etc. Check some examples of instructions here.
batch_size	int	32	Number of strings to encode at once.
progress_bar	bool	True	If true, displays progress bar during embedding.
normalize_embeddings	bool	False	If set to true, returned vectors will have the length of 1.
meta_fields_to_embed	Optional[List[str]]	None	List of meta fields that should be embedded along with the Document content.
embedding_separator	str	\n	Separator used to concatenate the meta fields to the Document content.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​