Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

LinkContentFetcher

Fetch and extract content from URLs. The component supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.

LinkContentFetcher supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in pipelines that index or process web content.

Key Features

  • Fetches content from a list of URLs and returns it as ByteStream objects.
  • Configurable retry attempts and timeout for resilient fetching.
  • Automatic user-agent rotation to handle sites that block default agents.
  • Optional HTTP/2 support.

Configuration

  1. Drag the LinkContentFetcher component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. Configure the component settings:
    • Set Retry Attempts to specify how many times to retry fetching a URL on failure. The default is two.
    • Set Timeout (in seconds) for each request. The default is three seconds.
    • Set Raise on Failure to control whether to raise an exception if fetching a single URL fails.
    • Set User Agents to provide a custom list of user agent strings for rotation.
    • Enable HTTP/2 support if required (requires the h2 package).
    • Set Client Kwargs for custom httpx client configuration.

Connections

LinkContentFetcher accepts a list of URL strings through its urls input. It outputs a list of ByteStream objects.

It typically connects with:

  • Web search components like SerperDevWebSearch: receives URLs from search results.
  • Converters like HTMLToDocument or MarkdownToDocument: sends fetched ByteStream content for conversion to documents.

Source Code

To check this component's source code, open link_content.py in the Haystack repository.

Usage Examples

Basic Configuration

  fetcher:
type: haystack.components.fetchers.link_content.LinkContentFetcher
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1

Using the Component in an Index

This index uses LinkContentFetcher to fetch web content and convert it to documents:

# haystack-pipeline
components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch

AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true

connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents

- sender: builder.prompt
receiver: AnthropicGenerator.prompt

- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies

max_runs_per_component: 100

metadata: {}

inputs:
query:
- search.query
- AnswerBuilder.query

outputs:
answers: AnswerBuilder.answers

Parameters

Inputs

ParameterTypeDescription
urlsList[str]A list of URLs to fetch content from.

Outputs

ParameterTypeDescription
streamsList[ByteStream]ByteStream objects representing the extracted content.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
raise_on_failureboolTrueIf True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
user_agentsOptional[List[str]]NoneUser agents for fetching content. If None, a default user agent is used.
retry_attemptsint2The number of times to retry fetching the URL's content.
timeoutint3Timeout in seconds for the request.
http2boolFalseWhether to enable HTTP/2 support for requests. Requires the h2 package to be installed (via pip install httpx[http2]).
client_kwargsOptional[Dict]NoneAdditional keyword arguments to pass to the httpx client. If None, default values are used.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
urlsList[str]A list of URLs to fetch content from.