HTMLToDocument

Convert HTML files to documents your pipeline can query.

Basic Information

Type: haystack.components.converters.html.HTMLToDocument
Components it can connect with:
- FileTypeRouter: HTMLToDocument can receive HTML files from FileTypeRouter.
- DocumentJoiner: HTMLToDocument can send converted documents to DocumentJoiner. This is useful if you have other converters in your pipeline and want to join their output with HTMLToDocument's output before sending it further down the pipeline.

Inputs

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		List of HTML file paths or ByteStream objects to convert.
meta	Optional[Union[Dict[str, Any], List[Dict[str, Any]]]]	None	Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists are zipped. If `sources` contains ByteStream objects, their `meta` is added to the output Documents.
extraction_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments to customize the extraction process.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		Converted documents

Overview

Use HTMLToDocument to convert HTML files to documents your pipeline can query. This component uses the Trafilatura library to extract text from HTML files. You can pass keyword arguments to customize the extraction process.

You can use HTMLToDocument in indexes or in query pipelines after LinkContentFecther to index content from websites.

Usage Example

Initializing the Component

components:
  HTMLToDocument:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:

Using the Component in an Index

In this index, HTMLToDocument receives HTML files from FileTypeRouter and sends them to DocumentJoiner.

components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
      - text/markdown
      - text/html
      - text/csv

  markdown_converter:
    type: haystack.components.converters.txt.TextFileToDocument
    init_parameters:
      encoding: utf-8

  html_converter:
    type: haystack.components.converters.html.HTMLToDocument
    init_parameters:
      # A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
      # For the full list of available arguments, see
      # the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
      extraction_kwargs:
        output_format: markdown # Extract text from HTML. You can also also choose "txt"
        target_language:       # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
        include_tables: true  # If true, includes tables in the output
        include_links: true  # If true, keeps links along with their targets


  csv_converter:
    type: haystack.components.converters.csv.CSVToDocument
    init_parameters:
      encoding: utf-8

  joiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate
      sort_by_score: false

  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en

  document_embedder:
    type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
    init_parameters:
      normalize_embeddings: true
      model: intfloat/e5-base-v2

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

connections:  # Defines how the components are connected
- sender: file_classifier.text/markdown
  receiver: markdown_converter.sources
- sender: file_classifier.text/html
  receiver: html_converter.sources
- sender: file_classifier.text/csv
  receiver: csv_converter.sources
- sender: markdown_converter.documents
  receiver: joiner.documents
- sender: html_converter.documents
  receiver: joiner.documents
- sender: joiner.documents
  receiver: splitter.documents
- sender: document_embedder.documents
  receiver: writer.documents
- sender: csv_converter.documents
  receiver: joiner.documents
- sender: splitter.documents
  receiver: document_embedder.documents

inputs:  # Define the inputs for your pipeline
  files:                            # This component will receive the files to index as input
  - file_classifier.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
extraction_kwargs	Optional[Dict[str, Any]]	None	A dictionary containing keyword arguments to customize the extraction process. These are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see the Trafilatura documentation.
store_full_path	bool	False	If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
sources	List[Union[str, Path, ByteStream]]		List of HTML file paths or ByteStream objects.
meta	Optional[Union[Dict[str, Any], List[Dict[str, Any]]]]	None	Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If `sources` contains ByteStream objects, their `meta` will be added to the output Documents.
extraction_kwargs	Optional[Dict[str, Any]]	None	Additional keyword arguments to customize the extraction process.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Initializing the Component​

Using the Component in an Index​

Parameters​

Init Parameters​

Run Method Parameters​