Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DeepsetDocumentMetadataPreProcessor

Process and transform document metadata by replacing metadata keys or converting metadata fields into document content.

Key Features

  • Replaces metadata keys with new names across all documents, for example, normalizing inconsistent keys from different sources.
  • Converts selected metadata fields into document content to improve full-text search.
  • Lets you specify which metadata fields to convert, or converts all fields if none are specified.
  • Adds a configurable prefix to each line of converted metadata.
  • Includes a debug mode for inspecting the component's behavior.

Configuration

  1. Drag the DeepsetDocumentMetadataPreProcessor component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. Configure the component settings:
    • Set Replace Fields to define metadata key replacements. For example, presiding_officer: judge renames the presiding_officer key to judge.
    • Toggle Convert Meta to Content to convert metadata fields into document content.
    • Set Meta Fields to Convert to specify which metadata fields to convert. If left empty, all metadata fields are converted.
    • Set Line Prefix to add a prefix to each line of converted metadata (default: - ).
    • Toggle Debug to display debugging information.

Connections

DeepsetDocumentMetadataPreProcessor accepts a list of Document objects and outputs processed Document objects with updated metadata or content.

It works with any component that outputs documents and accepts documents as input, such as converters, rankers, or retrievers.

Usage Examples

Basic Configuration

  DeepsetDocumentMetadataPreProcessor:
type: deepset_cloud_custom_nodes.preprocessors.document_metadata_preprocessor.DeepsetDocumentMetadataPreProcessor
init_parameters:
replace_fields:
- judge_name: judge
- presiding_officer: judge
convert_meta_to_content: false
line_prefix: '- '
debug: false

Using the Component in an Index

Replacing Metadata Keys

In this index, DeepsetDocumentMetadataPreProcessor normalizes all metadata keys into a unified key. Such index could work on legal documents from different sources, such as courts, law firms, or regulation bodies that often use inconsistent metadata keys. Some documents use judge_name, others presiding_officer, while we want a key called judge.

# haystack-pipeline
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv

text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

pdf_converter:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false

markdown_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
# A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
# For the full list of available arguments, see
# the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
extraction_kwargs:
output_format: markdown # Extract text from HTML. You can also also choose "txt"
target_language: # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
include_tables: true # If true, includes tables in the output
include_links: true # If true, keeps links along with their targets

docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters:
link_format: markdown

pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}

xlsx_converter:
type: haystack.components.converters.xlsx.XLSXToDocument
init_parameters: {}

csv_converter:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8

joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

joiner_xlsx: # merge split documents with non-split xlsx documents
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en

document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2

writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE

DeepsetDocumentMetadataPreProcessor:
type: deepset_cloud_custom_nodes.preprocessors.document_metadata_preprocessor.DeepsetDocumentMetadataPreProcessor
init_parameters:
replace_fields:
- judge_name: judge
- presiding_officer: judge
convert_meta_to_content: false
meta_fields_to_convert:
line_prefix: '- '
debug: false

connections: # Defines how the components are connected
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.application/pdf
receiver: pdf_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
receiver: csv_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: pdf_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: splitter.documents
receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
receiver: joiner_xlsx.documents
- sender: csv_converter.documents
receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: joiner.documents
receiver: DeepsetDocumentMetadataPreProcessor.documents
- sender: DeepsetDocumentMetadataPreProcessor.documents
receiver: splitter.documents

inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- file_classifier.sources

max_runs_per_component: 100

metadata: {}

Converting Metadata Into content

This example converts the metadata containing the judge name into the document content. This may be a good solution for full text search. The component adds the judge: judge_name to the document content while retaining them in the metadata as well:

# haystack-pipeline
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- text/markdown
- text/html
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- text/csv

text_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

pdf_converter:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false

markdown_converter:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8

html_converter:
type: haystack.components.converters.html.HTMLToDocument
init_parameters:
# A dictionary of keyword arguments to customize how you want to extract content from your HTML files.
# For the full list of available arguments, see
# the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract).
extraction_kwargs:
output_format: markdown # Extract text from HTML. You can also also choose "txt"
target_language: # You can define a language (using the ISO 639-1 format) to discard documents that don't match that language.
include_tables: true # If true, includes tables in the output
include_links: true # If true, keeps links along with their targets

docx_converter:
type: haystack.components.converters.docx.DOCXToDocument
init_parameters:
link_format: markdown

pptx_converter:
type: haystack.components.converters.pptx.PPTXToDocument
init_parameters: {}

xlsx_converter:
type: haystack.components.converters.xlsx.XLSXToDocument
init_parameters: {}

csv_converter:
type: haystack.components.converters.csv.CSVToDocument
init_parameters:
encoding: utf-8

joiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

joiner_xlsx: # merge split documents with non-split xlsx documents
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
sort_by_score: false

splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en

document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2

writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE

DeepsetDocumentMetadataPreProcessor:
type: deepset_cloud_custom_nodes.preprocessors.document_metadata_preprocessor.DeepsetDocumentMetadataPreProcessor
init_parameters:
replace_fields: "\n"
convert_meta_to_content: true
meta_fields_to_convert: judge
line_prefix: '- '
debug: false

connections: # Defines how the components are connected
- sender: file_classifier.text/plain
receiver: text_converter.sources
- sender: file_classifier.application/pdf
receiver: pdf_converter.sources
- sender: file_classifier.text/markdown
receiver: markdown_converter.sources
- sender: file_classifier.text/html
receiver: html_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.wordprocessingml.document
receiver: docx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.presentationml.presentation
receiver: pptx_converter.sources
- sender: file_classifier.application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
receiver: xlsx_converter.sources
- sender: file_classifier.text/csv
receiver: csv_converter.sources
- sender: text_converter.documents
receiver: joiner.documents
- sender: pdf_converter.documents
receiver: joiner.documents
- sender: markdown_converter.documents
receiver: joiner.documents
- sender: html_converter.documents
receiver: joiner.documents
- sender: docx_converter.documents
receiver: joiner.documents
- sender: pptx_converter.documents
receiver: joiner.documents
- sender: splitter.documents
receiver: joiner_xlsx.documents
- sender: xlsx_converter.documents
receiver: joiner_xlsx.documents
- sender: csv_converter.documents
receiver: joiner_xlsx.documents
- sender: joiner_xlsx.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
- sender: joiner.documents
receiver: DeepsetDocumentMetadataPreProcessor.documents
- sender: DeepsetDocumentMetadataPreProcessor.documents
receiver: splitter.documents

inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- file_classifier.sources

max_runs_per_component: 100

metadata: {}

Parameters

Inputs

ParameterTypeDescription
documentsOptional[List[Document]]List of Documents to process.

Outputs

ParameterTypeDescription
documentsList[Document]Processed documents.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
replace_fieldsOptional[dict]NoneDictionary with the metadata fields to replace. It must contain the names of the fields to replace and their new values. For example: presiding_officer: judge replaces metadata fields called "presiding_officer" with "judge".
convert_meta_to_contentOptional[bool]FalseConverts metadata to document content.
meta_fields_to_convertOptional[List[str]]NoneList of metadata fields to convert to content. If None, all metadata fields are converted.
line_prefixstr-Prefix to add to each line of the converted metadata.
debugboolFalseDisplays debugging information for the component.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsOptional[List[Document]]List of Documents to process.