Skip to main content

LinkContentFetcher

Fetches and extracts content from URLs.

Basic Information

  • Type: haystack.components.fetchers.link_content.LinkContentFetcher
  • Components it can connect with:
    • Any component that produces a list of URLs as output, such as SerperDevWebSearch
    • Converters: Sends fetched content (ByteStream) to converters like HTMLToDocument or MarkdownToDocument.

Inputs

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.

Outputs

ParameterTypeDefaultDescription
streamsList[ByteStream]ByteStream objects representing the extracted content.

Overview

LinkContentFetcher fetches and extracts content from URLs. It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.

Use this component as the data-fetching step in your pipelines and indexes. Convert LinkContentFetcher's output into documents using a converter like HTMLToDocument or MarkdownToDocument.

Usage Example

This index uses LinkContentFetcher to fetch web content and convert it to documents:

components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch

AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true

connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents

- sender: builder.prompt
receiver: AnthropicGenerator.prompt

- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies

max_runs_per_component: 100

metadata: {}

inputs:
query:
- search.query
- AnswerBuilder.query

outputs:
answers: AnswerBuilder.answers

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
raise_on_failureboolTrueIf True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
user_agentsOptional[List[str]]NoneUser agents for fetching content. If None, a default user agent is used.
retry_attemptsint2The number of times to retry to fetch the URL's content.
timeoutint3Timeout in seconds for the request.
http2boolFalseWhether to enable HTTP/2 support for requests. Requires the 'h2' package to be installed (via pip install httpx[http2]).
client_kwargsOptional[Dict]NoneAdditional keyword arguments to pass to the httpx client. If None, default values are used.

Run Method Parameters

These are the parameters you can configure for the run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.