LinkContentFetcher
Fetches and extracts content from URLs.
Basic Information
- Type:
haystack_integrations.fetchers.link_content.LinkContentFetcher
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| streams | List[ByteStream] | ByteStream objects representing the extracted content. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
Fetches and extracts content from URLs.
It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.
You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.For async usage:
import asyncio
from haystack.components.fetchers import LinkContentFetcher
async def fetch_async():
fetcher = LinkContentFetcher()
result = await fetcher.run_async(urls=["https://www.google.com"])
return result["streams"]
streams = asyncio.run(fetch_async())
Usage Example
components:
LinkContentFetcher:
type: components.fetchers.link_content.LinkContentFetcher
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| raise_on_failure | bool | True | If True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched. |
| user_agents | Optional[List[str]] | None | User agents for fetching content. If None, a default user agent is used. |
| retry_attempts | int | 2 | The number of times to retry to fetch the URL's content. |
| timeout | int | 3 | Timeout in seconds for the request. |
| http2 | bool | False | Whether to enable HTTP/2 support for requests. Defaults to False. Requires the 'h2' package to be installed (via pip install httpx[http2]). |
| client_kwargs | Optional[Dict] | None | Additional keyword arguments to pass to the httpx client. If None, default values are used. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | A list of URLs to fetch content from. |
Was this page helpful?