DeepsetFirecrawlWebScraper

Use the Firecrawl service to crawl websites and return the crawled documents.

Basic Information

Type: deepset_cloud_custom_nodes.crawler.firecrawl.FirecrawlComponent
Components it can connect with:
- DeepsetFirecrawlWebScraper can receive URLs from DeepsetCSVRowsToDocumentsConverter through OutputAdapter, which transforms the list of documents the converter produces into a list of strings the scraper can accept.
- DeepsetFirecrawlWebScraper can send the scraped documents to the DocumentWriter to write them into the document store.

Inputs

Name	Type	Description
`urls`	List of strings	The URLs to crawl. Required.
`params`	Dictionary of string and any	The parameters for the crawl request.

Outputs

Name	Type	Description
`documents`	List of Document objects	Documents containing the crawled content.

Overview

The DeepsetFirecrawlWebScraper crawls websites using the paid Firecrawl service. Firecrawl returns the content in Markdown format, optimized for use with LLMs. For more details, see Firecrawl.

📘
Page Crawl Limits
When using this component, be sure to set a page crawl limit, as Firecrawl will otherwise crawl all subpages, potentially leading to high charges. You can do this using the limit parameter. For details, see the Init Parameters section below.

Passing the URLs

DeepsetFirecrawlWebScraper needs to receive the URLs to crawl in a CSV file. The file must contain a column called urls where each row contains a URL to crawl. For DeepsetFirecrawlWebScraper to be able to take in the documents, you should structure your indexing pipeline as follows:

Start with DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object.

Connects DeepsetCSVRowsToDocumentsConverter to OutputAdapter with its template set to turn a list of documents into a list of strings. This is an example template you can use:

OutputAdapter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        template: |-
          {% set ns = namespace(str_list=[]) %} 
          {% for document in documents %} 
            {% set _ = ns.str_list.append(document.content) %} 
          {% endfor %} 
          {{ ns.str_list }}

Send the resulting list of strings to DeepsetFirecrawlWebScraper.

Authorization

To use this component, you must have an active Firecrawl API key. Set this key as the FIRECRAWL_API_KEY secret in deepset AI Platform. For more information about secrets, see Add Secrets to Connect to Third Party Providers.

Usage Example

This example shows an index that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.

The index starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.

Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.

components:
  OutputAdapter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        template: |-
          {% set ns = namespace(str_list=[]) %} 
          {% for document in documents %} 
            {% set _ = ns.str_list.append(document.content) %} 
          {% endfor %} 
          {{ ns.str_list }}
        output_type: typing.List[str]
  DocumentWriter:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
         policy: NONE
   DeepsetCSVRowsToDocumentsConverter:
      type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
      init_parameters:
        content_column: urls
        encoding: utf-8
   DeepsetFirecrawlWebScraper:
      type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
      init_parameters:
        params: null
  connections:  # Defines how the components are connected
  - sender: DeepsetCSVRowsToDocumentsConverter.documents
    receiver: OutputAdapter.documents
  - sender: OutputAdapter.output
    receiver: DeepsetFirecrawlWebScraper.urls
  - sender: DeepsetFirecrawlWebScraper.documents
    receiver: DocumentWriter.documents

 max_loops_allowed: 100
  metadata: {}
  inputs:
    files: DeepsetCSVRowsToDocumentsConverter.sources

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

Parameter	Type	Possible values	Description
`params`	Dictionary	Default: `{"limit": 1, "scrapeOptions": {"formats": ["markdown"]}}`	Parameters for the crawl request. For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation. It's important to set the `limit` parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges. Required.
`api_key`	Secret	Default: `Secret.from_env_var("FIRECRAWL_API_KEY")`	The API key for Firecrawl. By default, it's read from the `FIRECRAWL_API_KEY` environment variable. Required.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

Run() method parameters take precedence over initialization parameters.

Parameter	Type	Description
`urls`	List of strings	URLs of the websites to crawl. Required.
`params`	Dictionary	Parameters for the crawl request. Required.

Updated about 2 months ago

Basic Information

Inputs

Outputs

Overview

📘Page Crawl Limits

Passing the URLs

Authorization

Usage Example

Parameters

Init Parameters

Run Method Parameters

📘
Page Crawl Limits