LinkContentFetcher
Fetches and extracts content from URLs.
Basic Information
- Type:
haystack.components.fetchers.link_content.LinkContentFetcher - Components it can connect with:
- Any component that produces a list of URLs as output, such as
SerperDevWebSearch - Converters: Sends fetched content (
ByteStream) to converters likeHTMLToDocumentorMarkdownToDocument.
- Any component that produces a list of URLs as output, such as
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| streams | List[ByteStream] | ByteStream objects representing the extracted content. |
Overview
LinkContentFetcher fetches and extracts content from URLs. It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests.
Use this component as the data-fetching step in your pipelines and indexes. Convert LinkContentFetcher's output into documents using a converter like HTMLToDocument or MarkdownToDocument.
Usage Example
This index uses LinkContentFetcher to fetch web content and convert it to documents:
components:
builder:
init_parameters:
required_variables: "*"
template: |-
{% for doc in docs %}
{% if doc.content and doc.meta.url|length > 0 %}
<search-result url="{{ doc.meta.url }}">
{{ doc.content|truncate(25000) }}
</search-result>
{% endif %}
{% endfor %}
variables:
type: haystack.components.builders.prompt_builder.PromptBuilder
converter:
init_parameters:
extraction_kwargs: {}
store_full_path: false
type: haystack.components.converters.html.HTMLToDocument
fetcher:
init_parameters:
raise_on_failure: false
retry_attempts: 2
timeout: 3
user_agents:
- haystack/LinkContentFetcher/2.11.1
type: haystack.components.fetchers.link_content.LinkContentFetcher
search:
init_parameters:
api_key:
env_vars:
- SERPERDEV_API_KEY
strict: false
type: env_var
search_params: {}
top_k: 10
type: haystack.components.websearch.serper_dev.SerperDevWebSearch
AnthropicGenerator:
type: haystack_integrations.components.generators.anthropic.generator.AnthropicGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- ANTHROPIC_API_KEY
strict: false
model: claude-sonnet-4-20250514
streaming_callback:
system_prompt:
generation_kwargs:
timeout:
max_retries:
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true
connections:
- receiver: fetcher.urls
sender: search.links
- receiver: converter.sources
sender: fetcher.streams
- receiver: builder.docs
sender: converter.documents
- sender: builder.prompt
receiver: AnthropicGenerator.prompt
- sender: AnthropicGenerator.replies
receiver: AnswerBuilder.replies
max_runs_per_component: 100
metadata: {}
inputs:
query:
- search.query
- AnswerBuilder.query
outputs:
answers: AnswerBuilder.answers
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| raise_on_failure | bool | True | If True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched. |
| user_agents | Optional[List[str]] | None | User agents for fetching content. If None, a default user agent is used. |
| retry_attempts | int | 2 | The number of times to retry to fetch the URL's content. |
| timeout | int | 3 | Timeout in seconds for the request. |
| http2 | bool | False | Whether to enable HTTP/2 support for requests. Requires the 'h2' package to be installed (via pip install httpx[http2]). |
| client_kwargs | Optional[Dict] | None | Additional keyword arguments to pass to the httpx client. If None, default values are used. |
Run Method Parameters
These are the parameters you can configure for the run() method. You can pass these parameters at query time through the API, in Playground, or when running a job.
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Was this page helpful?