LinkContentFetcher

Fetches and extracts content from URLs.

Basic Information

Type: haystack_integrations.fetchers.link_content.LinkContentFetcher

Inputs

Parameter	Type	Default	Description
urls	List[str]		A list of URLs to fetch content from.

Outputs

Parameter	Type	Default	Description
streams	List[ByteStream]		`ByteStream` objects representing the extracted content.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Fetches and extracts content from URLs.

It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.

You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.For async usage:

import asyncio
from haystack.components.fetchers import LinkContentFetcher

async def fetch_async():
    fetcher = LinkContentFetcher()
    result = await fetcher.run_async(urls=["https://www.google.com"])
    return result["streams"]

streams = asyncio.run(fetch_async())

Usage Example

components:
  LinkContentFetcher:
    type: components.fetchers.link_content.LinkContentFetcher
    init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Default	Description
raise_on_failure	bool	True	If `True`, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
user_agents	Optional[List[str]]	None	User agents for fetching content. If `None`, a default user agent is used.
retry_attempts	int	2	The number of times to retry to fetch the URL's content.
timeout	int	3	Timeout in seconds for the request.
http2	bool	False	Whether to enable HTTP/2 support for requests. Defaults to False. Requires the 'h2' package to be installed (via `pip install httpx[http2]`).
client_kwargs	Optional[Dict]	None	Additional keyword arguments to pass to the httpx client. If `None`, default values are used.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Parameter	Type	Default	Description
urls	List[str]		A list of URLs to fetch content from.

Was this page helpful?

Basic Information​

Inputs​

Outputs​

Overview​

Usage Example​

Parameters​

Init Parameters​

Run Method Parameters​