LLMMetadataExtractor

Extracts metadata from documents using a Large Language Model (LLM).

Basic Information

Type: haystack_integrations.extractors.llm_metadata_extractor.LLMMetadataExtractor

Inputs

Parameter	Type	Default	Description
documents	List[Document]		List of documents to extract metadata from.
page_range	Optional[List[Union[str, int]]]	None	A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 11, 12. If None, metadata will be extracted from the entire document for each document in the documents list.

Outputs

Parameter	Type	Default	Description
documents	List[Document]		A dictionary with the keys: - "documents": A list of documents that were successfully updated with the extracted metadata. - "failed_documents": A list of documents that failed to extract metadata. These documents will have "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be re-run with the extractor to extract metadata.
failed_documents	List[Document]		A dictionary with the keys: - "documents": A list of documents that were successfully updated with the extracted metadata. - "failed_documents": A list of documents that failed to extract metadata. These documents will have "metadata_extraction_error" and "metadata_extraction_response" in their metadata. These documents can be re-run with the extractor to extract metadata.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Extracts metadata from documents using a Large Language Model (LLM).

The metadata is extracted by providing a prompt to an LLM that generates the metadata.

The component automatically calls warm_up() at runtime if it hasn't been warmed up yet, ensuring it's ready for use without requiring an explicit warm-up call.

This component expects as input a list of documents and a prompt. The prompt should have a variable called document that will point to a single document in the list of documents. So to access the content of the document, you can use {{ document.content }} in the prompt.

The component will run the LLM on each document in the list and extract metadata from the document. The metadata will be added to the document's metadata field. If the LLM fails to extract metadata from a document, the document will be added to the failed_documents list. The failed documents will have the keys metadata_extraction_error and metadata_extraction_response in their metadata. These documents can be re-run with another extractor to extract metadata by using the metadata_extraction_response and metadata_extraction_error in the prompt.

from haystack import Document
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

NER_PROMPT = '''
-Goal-
Given text and a list of entity types, identify all entities of those types from the text.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [organization, product, service, industry]
Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}

2. Return output in a single list with all the entities identified in steps 1.

-Examples-
######################
Example 1:
entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
base and high cross-border usage.
We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
agreement with Emirates Skywards.
And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
issuers are equally
deepset_platform_metadata:
  group: concepts
  navigation: guides
  section: pipeline-components
  source: docusaurus-build
  title: llmmetadataextractor
  type: concept

---
output:
{"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
#############################
-Real Data-
######################
entity_types: [company, organization, person, country, product, service]
text: {{ document.content }}
######################
output:
'''

docs = [
    Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
    Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library")
]

chat_generator = OpenAIChatGenerator(
    generation_kwargs={
        "max_tokens": 500,
        "temperature": 0.0,
        "seed": 0,
        "response_format": {"type": "json_object"},
    },
    max_retries=1,
    timeout=60.0,
)

extractor = LLMMetadataExtractor(
    prompt=NER_PROMPT,
    chat_generator=generator,
    expected_keys=["entities"],
    raise_on_failure=False,
)

extractor.warm_up()
extractor.run(documents=docs)
>> {'documents': [
    Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
    meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
          {'entity': 'Haystack', 'entity_type': 'product'}]}),
    Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
    meta: {'entities': [
            {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
            {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
            ]})
       ]
    'failed_documents': []
   }
>>

Usage Example

components:
  LLMMetadataExtractor:
    type: components.extractors.llm_metadata_extractor.LLMMetadataExtractor
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
prompt	str		The prompt to be used for the LLM.
chat_generator	ChatGenerator		a ChatGenerator instance which represents the LLM. In order for the component to work, the LLM should be configured to return a JSON object. For example, when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the `generation_kwargs`.
expected_keys	Optional[List[str]]	None	The keys expected in the JSON output from the LLM.
page_range	Optional[List[Union[str, int]]]	None	A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10,11, 12. If None, metadata will be extracted from the entire document for each document in the documents list. This parameter is optional and can be overridden in the `run` method.
raise_on_failure	bool	False	Whether to raise an error on failure during the execution of the Generator or validation of the JSON output.
max_workers	int	3	The maximum number of workers to use in the thread pool executor.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
documents	List[Document]		List of documents to extract metadata from.
page_range	Optional[List[Union[str, int]]]	None	A range of pages to extract metadata from. For example, page_range=['1', '3'] will extract metadata from the first and third pages of each document. It also accepts printable range strings, e.g.: ['1-3', '5', '8', '10-12'] will extract metadata from pages 1, 2, 3, 5, 8, 10, 11, 12. If None, metadata will be extracted from the entire document for each document in the documents list.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​