HuggingFaceAPIDocumentEmbedder

Embed documents using Hugging Face APIs.

Basic Information

Type: haystack.components.embedders.hugging_face_api_document_embedder.HuggingFaceAPIDocumentEmbedder
Components it can connect with:
- PreProcessors: HuggingFaceAPIDocumentEmbedder can receive the documents to embed from a PreProcessor, like DocumentSplitter.
- DocumentWriter: HuggingFaceAPIDocumentEmbedder can send the embedded documents to DocumentWriter that writes them into a document store.

Inputs

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A list of documents with embeddings.

Overview

Use HuggingFaceAPIDocumentEmbedder with the following Hugging Face APIs:

Free Serverless Inference API
Paid Inference Endpoints
Self-hosted Text Embeddings Inference#### With paid inference endpoints

Embedding Models in Query Pipelines and Indexes

The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.

This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.

Usage Example

Initializing the Component

components:
  HuggingFaceAPIDocumentEmbedder:
    type: components.embedders.hugging_face_api_document_embedder.HuggingFaceAPIDocumentEmbedder
    init_parameters:

Using the Component in an Index

This is an example index for preprocessing multiple document types. The documents resulting from file conversion as sent to the HuggingFaceAPIDocumentEmbedder which embeds them and sends them to the DocumentWriter that writes them into an OpenSearch document store.

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
      - text/plain
      - application/pdf
      - text/markdown
      - text/html
      - application/vnd.openxmlformats-officedocument.wordprocessingml.document
      - application/vnd.openxmlformats-officedocument.presentationml.presentation
      - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      - text/csv

  text_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  pdf_converter:
    type: haystack.components.converters.pdfminer.PDFMinerToDocument
    init_parameters:
      line_overlap: 0.5
      char_margin: 2
      line_margin: 0.5
      word_margin: 0.1
      boxes_flow: 0.5
      detect_vertical: true
      all_texts: false
      store_full_path: false

  markdown_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      # A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
      # For the full list of available arguments, see
      # the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
      extraction_kwargs:
        output_format: markdown # Extract text from HTML. You can also also choose "txt"
        target_language:       # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
        include_tables: true  # If true, includes tables in the output
        include_links: true  # If true, keeps links along with their targets

  docx_converter:
    type: haystack.components.converters.docx.DOCXToDocument
    init_parameters:
      link_format: markdown

  pptx_converter:
    type: haystack.components.converters.pptx.PPTXToDocument
    init_parameters: {}

  xlsx_converter:
    type: haystack.components.converters.XLSXToDocument
    init_parameters: {}

  csv_converter:
    type: haystack.components.converters.csv.CSVToDocument
    init_parameters:
      encoding: utf-8

  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  joiner_xlsx:  # merge split documents with non-split xlsx documents
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

  HuggingFaceAPIDocumentEmbedder:
    type: haystack.components.embedders.hugging_face_api_document_embedder.HuggingFaceAPIDocumentEmbedder
    init_parameters:
      api_type: serverless_inference_api
      api_params:
        model: BAAI/bge-small-en-v1.5
      token:
        type: env_var
        env_vars:
        - HF_API_TOKEN
        - HF_TOKEN
        strict: false
      prefix: ''
      suffix: ''
      truncate: true
      normalize: false
      batch_size: 32
      progress_bar: true
      meta_fields_to_embed:
      embedding_separator: \n

connections:  # Defines how the components are connected
- sender: file_classifier.text/plain
  receiver: text_converter.sources
- sender: file_classifier.application/pdf
  receiver: pdf_converter.sources
- sender: file_classifier.text/markdown
  receiver: markdown_converter.sources
- sender: file_classifier.text/html
  receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
  receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
  receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
  receiver: csv_converter.sources
- sender: text_converter.documents
  receiver: joiner.documents
- sender: pdf_converter.documents
  receiver: joiner.documents
- sender: markdown_converter.documents
  receiver: joiner.documents
- sender: html_converter.documents
  receiver: joiner.documents
- sender: docx_converter.documents
  receiver: joiner.documents
- sender: pptx_converter.documents
  receiver: joiner.documents
- sender: joiner.documents
  receiver: splitter.documents
- sender: splitter.documents
  receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
  receiver: joiner_xlsx.documents
- sender: csv_converter.documents
  receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
  receiver: HuggingFaceAPIDocumentEmbedder.documents
- sender: HuggingFaceAPIDocumentEmbedder.documents
  receiver: writer.documents

inputs:  # Define the inputs for your pipeline
  files:                            # This component will receive the files to index as input
  - file_classifier.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
api_type	Union[HFEmbeddingAPIType, str]		The type of Hugging Face API to use. Possible values: - `SERVERLESS_INFERENCE_API`: Hugging Face Serverless Inference API. It's a free tier. With this option, you must pass the model and api parameters. - `INFERENCE_ENDPOINTS`: Hugging Face Inference Endpoints. A paid tier that requires URL and api parameters. - `TEXT_EMBEDDINGS_INFERENCE`: Self-hosted text embeddings inference. Requires URL and api parameters.
api_params	Dict[str, str]		A dictionary with the following keys: - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`.
token	Optional[Secret]	Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False)	The Hugging Face token used to connect deepset AI Platform to your Hugging Face account. Check your HF token in your account settings.
prefix	str		A string to add at the beginning of each text.
suffix	str		A string to add at the end of each text.
truncate	Optional[bool]	True	Truncates the input text to the maximum length supported by the model. Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` if the backend uses Text Embeddings Inference. If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
normalize	Optional[bool]	False	Normalizes the embeddings to unit length. Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` if the backend uses Text Embeddings Inference. If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
batch_size	int	32	Number of documents to process at once.
progress_bar	bool	True	If `True`, shows a progress bar when running.
meta_fields_to_embed	Optional[List[str]]	None	List of metadata fields to embed along with the document text.
embedding_separator	str	\n	Separator used to concatenate the metadata fields to the document text.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		Documents to embed.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Embedding Models in Query Pipelines and Indexes

Usage Example​

Initializing the Component​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​