LinkContentFetcher
Fetch and extract content from URLs. The component supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.
Key Features
- Fetches content from a list of URLs and returns it as ByteStream objects.
- Retries failed requests a configurable number of times.
- Rotates user agents automatically on failed requests.
- Configurable request timeout.
- Supports HTTP/2 when the
h2package is installed. - Accepts additional keyword arguments for customizing the underlying httpx client.
Configuration
- Drag the
LinkContentFetchercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
LinkContentFetcher accepts a list of URLs (urls) as input and outputs a list of ByteStream objects (streams) representing the fetched content.
Typically, LinkContentFetcher receives URLs from a web search component such as SerperDevWebSearch. Its streams output connects to a converter like HTMLToDocument or MarkdownToDocument to turn the fetched content into documents for further processing.
Usage Example
This index uses LinkContentFetcher to fetch web content and convert it to documents:
components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch
AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true
connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents
- sender: builder.prompt
receiver: AnthropicGenerator.prompt
- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies
max_runs_per_component: 100
metadata: {}
inputs:
query:
- search.query
- AnswerBuilder.query
outputs:
answers: AnswerBuilder.answers
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| streams | List[ByteStream] | ByteStream objects representing the extracted content. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| raise_on_failure | bool | True | If True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched. |
| user_agents | Optional[List[str]] | None | User agents for fetching content. If None, a default user agent is used. |
| retry_attempts | int | 2 | The number of times to retry to fetch the URL's content. |
| timeout | int | 3 | Timeout in seconds for the request. |
| http2 | bool | False | Whether to enable HTTP/2 support for requests. Requires the 'h2' package to be installed (via pip install httpx[http2]). |
| client_kwargs | Optional[Dict] | None | Additional keyword arguments to pass to the httpx client. If None, default values are used. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Was this page helpful?