Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

LinkContentFetcher

Fetch and extract content from URLs. The component supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.

Key Features

  • Fetches content from a list of URLs and returns it as ByteStream objects.
  • Retries failed requests a configurable number of times.
  • Rotates user agents automatically on failed requests.
  • Configurable request timeout.
  • Supports HTTP/2 when the h2 package is installed.
  • Accepts additional keyword arguments for customizing the underlying httpx client.

Configuration

  1. Drag the LinkContentFetcher component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed.

Connections

LinkContentFetcher accepts a list of URLs (urls) as input and outputs a list of ByteStream objects (streams) representing the fetched content.

Typically, LinkContentFetcher receives URLs from a web search component such as SerperDevWebSearch. Its streams output connects to a converter like HTMLToDocument or MarkdownToDocument to turn the fetched content into documents for further processing.

Usage Example

This index uses LinkContentFetcher to fetch web content and convert it to documents:

components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch

AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true

connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents

- sender: builder.prompt
receiver: AnthropicGenerator.prompt

- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies

max_runs_per_component: 100

metadata: {}

inputs:
query:
- search.query
- AnswerBuilder.query

outputs:
answers: AnswerBuilder.answers

Parameters

Inputs

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.

Outputs

ParameterTypeDefaultDescription
streamsList[ByteStream]ByteStream objects representing the extracted content.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
raise_on_failureboolTrueIf True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
user_agentsOptional[List[str]]NoneUser agents for fetching content. If None, a default user agent is used.
retry_attemptsint2The number of times to retry to fetch the URL's content.
timeoutint3Timeout in seconds for the request.
http2boolFalseWhether to enable HTTP/2 support for requests. Requires the 'h2' package to be installed (via pip install httpx[http2]).
client_kwargsOptional[Dict]NoneAdditional keyword arguments to pass to the httpx client. If None, default values are used.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.