Skip to main content

LinkContentFetcher

Fetches and extracts content from URLs.

Basic Information

  • Type: haystack_integrations.fetchers.link_content.LinkContentFetcher

Inputs

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.

Outputs

ParameterTypeDefaultDescription
streamsList[ByteStream]ByteStream objects representing the extracted content.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

Fetches and extracts content from URLs.

It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.

You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.For async usage:

import asyncio
from haystack.components.fetchers import LinkContentFetcher

async def fetch_async():
fetcher = LinkContentFetcher()
result = await fetcher.run_async(urls=["https://www.google.com"])
return result["streams"]

streams = asyncio.run(fetch_async())

Usage Example

components:
LinkContentFetcher:
type: components.fetchers.link_content.LinkContentFetcher
init_parameters:

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
raise_on_failureboolTrueIf True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
user_agentsOptional[List[str]]NoneUser agents for fetching content. If None, a default user agent is used.
retry_attemptsint2The number of times to retry to fetch the URL's content.
timeoutint3Timeout in seconds for the request.
http2boolFalseWhether to enable HTTP/2 support for requests. Defaults to False. Requires the 'h2' package to be installed (via pip install httpx[http2]).
client_kwargsOptional[Dict]NoneAdditional keyword arguments to pass to the httpx client. If None, default values are used.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
urlsList[str]A list of URLs to fetch content from.