Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DeepsetFirecrawlWebScraper

Use the Firecrawl service to crawl websites and return the crawled documents.

Key Features

  • Crawls websites using the paid Firecrawl service.
  • Returns crawled content in Markdown format, optimized for use with LLMs.
  • Accepts URLs from a CSV file processed through DeepsetCSVRowsToDocumentsConverter and OutputAdapter.
Page Crawl limits

When using this component, set a page crawl limit using the limit parameter. Without a limit, Firecrawl crawls all subpages, which can result in high charges.

Configuration

  1. Drag the DeepsetFirecrawlWebScraper component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. On the General tab:
    1. Set your Firecrawl API key. Add FIRECRAWL_API_KEY as a secret in Haystack Enterprise Platform. For more information, see Add Secrets to Connect to Third Party Providers.
    2. Configure params with at least a limit value to cap the number of pages crawled. For a full list of accepted parameters, see the Firecrawl crawl-post endpoint documentation.
  4. Go to the Advanced tab to configure additional crawl parameters.

Passing the URLs

DeepsetFirecrawlWebScraper needs to receive URLs to crawl in a CSV file. The file must contain a column called urls where each row contains a URL to crawl. Structure your indexing pipeline as follows:

  1. Start with DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object.
  2. Connect DeepsetCSVRowsToDocumentsConverter to OutputAdapter with its template set to turn a list of documents into a list of strings. This is an example template you can use:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
  1. Send the resulting list of strings to DeepsetFirecrawlWebScraper.

Connections

DeepsetFirecrawlWebScraper receives a list of URL strings from OutputAdapter. It outputs a list of documents containing the scraped content through its documents output, which you connect to DocumentWriter to write them into a document store.

Usage Examples

Basic Configuration

  DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
api_key:
type: env_var
env_vars:
- FIRECRAWL_API_KEY
strict: false

Using the Component in a Pipeline

This example shows an index that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.

The index starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.

Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.

components:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
output_type: typing.List[str]
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: NONE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: urls
encoding: utf-8
DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
params: null

connections: # Defines how the components are connected
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: OutputAdapter.documents
- sender: OutputAdapter.output
receiver: DeepsetFirecrawlWebScraper.urls
- sender: DeepsetFirecrawlWebScraper.documents
receiver: DocumentWriter.documents

max_loops_allowed: 100
metadata: {}
inputs:
files: DeepsetCSVRowsToDocumentsConverter.sources

Parameters

Inputs

ParameterTypeDefaultDescription
urlsList[str]URLs of the websites to crawl.
paramsDict | NoneNoneParameters for the crawl request.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]List of Documents containing the crawled content.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
paramsDict | NoneFor a list of accepted parameters, see the Body section of the crawl-post endpoint documentation. It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges.
api_keySecretSecret.from_env_var('FIRECRAWL_API_KEY')API key for Firecrawl. If not provided, it's read from the FIRECRAWL_API_KEY environment variable.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
urlsList[str]URLs of the websites to crawl.
paramsDict | NoneNoneParameters for the crawl request.