LinkContentFetcher
Fetch and extract content from URLs. The component supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.
LinkContentFetcher supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in pipelines that index or process web content.
Key Features
- Fetches content from a list of URLs and returns it as
ByteStreamobjects. - Configurable retry attempts and timeout for resilient fetching.
- Automatic user-agent rotation to handle sites that block default agents.
- Optional HTTP/2 support.
Configuration
- Drag the
LinkContentFetchercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Set Retry Attempts to specify how many times to retry fetching a URL on failure. The default is two.
- Set Timeout (in seconds) for each request. The default is three seconds.
- Set Raise on Failure to control whether to raise an exception if fetching a single URL fails.
- Set User Agents to provide a custom list of user agent strings for rotation.
- Enable HTTP/2 support if required (requires the
h2package). - Set Client Kwargs for custom
httpxclient configuration.
Connections
LinkContentFetcher accepts a list of URL strings through its urls input. It outputs a list of ByteStream objects.
It typically connects with:
- Web search components like
SerperDevWebSearch: receives URLs from search results. - Converters like
HTMLToDocumentorMarkdownToDocument: sends fetchedByteStreamcontent for conversion to documents.
Source Code
To check this component's source code, open link_content.py in the Haystack repository.
Usage Examples
Basic Configuration
fetcher:
type: haystack.components.fetchers.link_content.LinkContentFetcher
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
Using the Component in an Index
This index uses LinkContentFetcher to fetch web content and convert it to documents:
# haystack-pipeline
components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch
AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true
connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents
- sender: builder.prompt
receiver: AnthropicGenerator.prompt
- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies
max_runs_per_component: 100
metadata: {}
inputs:
query:
- search.query
- AnswerBuilder.query
outputs:
answers: AnswerBuilder.answers
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
urls | List[str] | A list of URLs to fetch content from. |
Outputs
| Parameter | Type | Description |
|---|---|---|
streams | List[ByteStream] | ByteStream objects representing the extracted content. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
raise_on_failure | bool | True | If True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched. |
user_agents | Optional[List[str]] | None | User agents for fetching content. If None, a default user agent is used. |
retry_attempts | int | 2 | The number of times to retry fetching the URL's content. |
timeout | int | 3 | Timeout in seconds for the request. |
http2 | bool | False | Whether to enable HTTP/2 support for requests. Requires the h2 package to be installed (via pip install httpx[http2]). |
client_kwargs | Optional[Dict] | None | Additional keyword arguments to pass to the httpx client. If None, default values are used. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
urls | List[str] | A list of URLs to fetch content from. |
Was this page helpful?