DeepsetFirecrawlWebScraper

Use the Firecrawl service to crawl websites and return the crawled documents.

Basic Information

  • Type: deepset_cloud_custom_nodes.crawler.firecrawl.FirecrawlComponent
  • Components it can connect with:
    • DeepsetFirecrawlWebScraper can receive URLs from DeepsetCSVRowsToDocumentsConverter through OutputAdapter, which transforms the list of documents the converter produces into a list of strings the scraper can accept.
    • DeepsetFirecrawlWebScraper can send the scraped documents to the DocumentWriter to write them into the document store.

Inputs

Name

Type

Description

urls

List of strings

The URLs to crawl.
Required.

params

Dictionary of string and any

The parameters for the crawl request.

Outputs

NameTypeDescription
documentsList of Document objectsDocuments containing the crawled content.

Overview

The DeepsetFirecrawlWebScraper crawls websites using the paid Firecrawl service. Firecrawl returns the content in Markdown format, optimized for use with LLMs. For more details, see Firecrawl.

📘

Page Crawl Limits

When using this component, be sure to set a page crawl limit, as Firecrawl will otherwise crawl all subpages, potentially leading to high charges. You can do this using the limit parameter. For details, see the Init Parameters section below.

Passing the URLs

DeepsetFirecrawlWebScraper needs to receive the URLs to crawl in a CSV file. The file must contain a column called urls where each row contains a URL to crawl. For DeepsetFirecrawlWebScraper to be able to take in the documents, you should structure your indexing pipeline as follows:

  1. Start with DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object.
  2. Connects DeepsetCSVRowsToDocumentsConverter to OutputAdapter with its template set to turn a list of documents into a list of strings. This is an example template you can use:
    OutputAdapter:
          type: haystack.components.converters.output_adapter.OutputAdapter
          init_parameters:
            template: |-
              {% set ns = namespace(str_list=[]) %} 
              {% for document in documents %} 
                {% set _ = ns.str_list.append(document.content) %} 
              {% endfor %} 
              {{ ns.str_list }}
  3. Send the resulting list of strings to DeepsetFirecrawlWebScraper.

Authorization

To use this component, you must have an active Firecrawl API key. Set this key as the FIRECRAWL_API_KEY secret in deepset AI Platform. For more information about secrets, see Add Secrets to Connect to Third Party Providers.

Usage Example

This example shows an index that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.

The index starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.

Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.

components:
  OutputAdapter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        template: |-
          {% set ns = namespace(str_list=[]) %} 
          {% for document in documents %} 
            {% set _ = ns.str_list.append(document.content) %} 
          {% endfor %} 
          {{ ns.str_list }}
        output_type: typing.List[str]
  DocumentWriter:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
         policy: NONE
   DeepsetCSVRowsToDocumentsConverter:
      type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
      init_parameters:
        content_column: urls
        encoding: utf-8
   DeepsetFirecrawlWebScraper:
      type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
      init_parameters:
        params: null
  connections:  # Defines how the components are connected
  - sender: DeepsetCSVRowsToDocumentsConverter.documents
    receiver: OutputAdapter.documents
  - sender: OutputAdapter.output
    receiver: DeepsetFirecrawlWebScraper.urls
  - sender: DeepsetFirecrawlWebScraper.documents
    receiver: DocumentWriter.documents

 max_loops_allowed: 100
  metadata: {}
  inputs:
    files: DeepsetCSVRowsToDocumentsConverter.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:


Parameter

Type

Possible values

Description

params

Dictionary

Default: {"limit": 1, "scrapeOptions": {"formats": ["markdown"]}}

Parameters for the crawl request. For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation.
It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges.
Required.

api_key

Secret

Default: Secret.from_env_var("FIRECRAWL_API_KEY")

The API key for Firecrawl. By default, it's read from the FIRECRAWL_API_KEY environment variable.
Required.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Run() method parameters take precedence over initialization parameters.


Parameter

Type

Description

urls

List of strings

URLs of the websites to crawl.
Required.

params

Dictionary

Parameters for the crawl request.
Required.