HuggingFaceAPIDocumentEmbedder
Embed documents using Hugging Face APIs.
Key Features
- Embeds documents using Hugging Face APIs: free Serverless Inference API, paid Inference Endpoints, and self-hosted Text Embeddings Inference (TEI)
- Batch processing for efficient embedding of large document sets
- Optional metadata field embedding alongside document text
- Stores embeddings in each document's
embeddingfield
Configuration
- Drag the
HuggingFaceAPIDocumentEmbeddercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Select the API type:
SERVERLESS_INFERENCE_API,INFERENCE_ENDPOINTS, orTEXT_EMBEDDINGS_INFERENCE. - Enter the API parameters: for
SERVERLESS_INFERENCE_API, enter the model ID; forINFERENCE_ENDPOINTSorTEXT_EMBEDDINGS_INFERENCE, enter the endpoint URL. - Enter your Hugging Face API token. For details, see Use Hugging Face Models.
- Select the API type:
- Go to the Advanced tab to configure prefix, suffix, truncation, normalization, batch size, and metadata fields.
Connections
HuggingFaceAPIDocumentEmbedder accepts a list of Document objects through its documents input. It outputs a list of Document objects with embeddings stored in the embedding field.
Use this component in an indexing pipeline. Connect preprocessors like DocumentSplitter to its documents input, and connect its documents output to DocumentWriter.
Embedding Models in Query Pipelines and Indexes
The embedding model you use to embed documents in your indexing pipeline must be the same as the embedding model you use to embed the query in your query pipeline.
This means the embedders for your indexing and query pipelines must match. For example, if you use CohereDocumentEmbedder to embed your documents, you should use CohereTextEmbedder with the same model to embed your queries.
Source Code
To check this component's source code, open hugging_face_api_document_embedder.py in the Haystack repository.
Usage Examples
Basic Configuration
HuggingFaceAPIDocumentEmbedder:
type: haystack.components.embedders.hugging_face_api_document_embedder.HuggingFaceAPIDocumentEmbedder
init_parameters:
api_type: serverless_inference_api
api_params:
model: BAAI/bge-small-en-v1.5
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
prefix: ''
suffix: ''
truncate: true
normalize: false
batch_size: 32
progress_bar: true
embedding_separator: \n
Using the Component in an Index
This is an example index for preprocessing multiple document types. The documents resulting from file conversion are sent to HuggingFaceAPIDocumentEmbedder, which embeds them and sends them to DocumentWriter that writes them into an OpenSearch document store.
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv
text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
pdf_converter:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false
markdown_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
# A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
# For the full list of available arguments, see
# the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
extraction_kwargs:
output_format: markdown # Extract text from HTML. You can also also choose "txt"
target_language: # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
include_tables: true # If true, includes tables in the output
include_links: true # If true, keeps links along with their targets
docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters:
link_format: markdown
pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}
xlsx_converter:
type: haystack.components.converters.XLSXToDocument
init_parameters: {}
csv_converter:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8
joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false
joiner_xlsx: # merge split documents with non-split xlsx documents
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
HuggingFaceAPIDocumentEmbedder:
type: haystack.components.embedders.hugging_face_api_document_embedder.HuggingFaceAPIDocumentEmbedder
init_parameters:
api_type: serverless_inference_api
api_params:
model: BAAI/bge-small-en-v1.5
token:
type: env_var
env_vars:
- HF_API_TOKEN
- HF_TOKEN
strict: false
prefix: ''
suffix: ''
truncate: true
normalize: false
batch_size: 32
progress_bar: true
meta_fields_to_embed:
embedding_separator: \n
connections: # Defines how the components are connected
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.application/pdf
receiver: pdf_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
receiver: csv_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: pdf_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: joiner.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
receiver: joiner_xlsx.documents
- sender: csv_converter.documents
receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
receiver: HuggingFaceAPIDocumentEmbedder.documents
- sender: HuggingFaceAPIDocumentEmbedder.documents
receiver: writer.documents
inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- file_classifier.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Documents to embed. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of documents with embeddings. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
api_type | Union[HFEmbeddingAPIType, str] | The type of Hugging Face API to use. Possible values: SERVERLESS_INFERENCE_API (free tier, requires model and api parameters), INFERENCE_ENDPOINTS (paid tier, requires URL and api parameters), TEXT_EMBEDDINGS_INFERENCE (self-hosted, requires URL and api parameters). | |
api_params | Dict[str, str] | A dictionary with: model (Hugging Face model ID, required for SERVERLESS_INFERENCE_API), url (URL of the inference endpoint, required for INFERENCE_ENDPOINTS or TEXT_EMBEDDINGS_INFERENCE). | |
token | Optional[Secret] | Secret.from_env_var(['HF_API_TOKEN', 'HF_TOKEN'], strict=False) | The Hugging Face token used to connect Haystack Enterprise Platform to your Hugging Face account. Check your HF token in your account settings. |
prefix | str | A string to add at the beginning of each text. | |
suffix | str | A string to add at the end of each text. | |
truncate | Optional[bool] | True | Truncates the input text to the maximum length supported by the model. Applicable when api_type is TEXT_EMBEDDINGS_INFERENCE or INFERENCE_ENDPOINTS if the backend uses Text Embeddings Inference. If api_type is SERVERLESS_INFERENCE_API, this parameter is ignored. |
normalize | Optional[bool] | False | Normalizes the embeddings to unit length. Applicable when api_type is TEXT_EMBEDDINGS_INFERENCE or INFERENCE_ENDPOINTS if the backend uses Text Embeddings Inference. If api_type is SERVERLESS_INFERENCE_API, this parameter is ignored. |
batch_size | int | 32 | Number of documents to process at once. |
progress_bar | bool | True | If True, shows a progress bar when running. |
meta_fields_to_embed | Optional[List[str]] | None | List of metadata fields to embed along with the document text. |
embedding_separator | str | \n | Separator used to concatenate the metadata fields to the document text. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | Documents to embed. |
Related Information
Was this page helpful?