DeepsetFirecrawlWebScraper

Use the Firecrawl service to crawl websites and return the crawled documents.

Basic Information

  • Pipeline type: Indexing
  • Type: deepset_cloud_custom_nodes.crawler.firecrawl.FirecrawlComponent
  • Components it can connect with:
    • DeepsetFirecrawlWebScraper can receive URLs from DeepsetCSVRowsToDocumentsConverter through OutputAdapter, which transforms the list of documents the converter produces into a list of strings the scraper can accept.
    • DeepsetFirecrawlWebScraper can send the scraped documents to the DocumentWriter to write them into the document store.

Inputs

NameTypeDescription
urlsList of stringsThe URLs to crawl.
Required.
paramsDictionary of string and anyThe parameters for the crawl request.

Outputs

NameTypeDescription
documentsList of Document objectsDocuments containing the crawled content.

Overview

The DeepsetFirecrawlWebScraper crawls websites using the paid Firecrawl service. Firecrawl returns the content in Markdown format, optimized for use with LLMs. For more details, see Firecrawl.

📘

Page Crawl Limits

When using this component, be sure to set a page crawl limit, as Firecrawl will otherwise crawl all subpages, potentially leading to high charges. You can do this using the limit parameter. For details, see the Init Parameters section below.

Passing the URLs

DeepsetFirecrawlWebScraper needs to receive the URLs to crawl in a CSV file. The file must contain a column called urls where each row contains a URL to crawl. For DeepsetFirecrawlWebScraper to be able to take in the documents, you should structure your indexing pipeline as follows:

  1. Start with DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object.
  2. Connects DeepsetCSVRowsToDocumentsConverter to OutputAdapter with its template set to turn a list of documents into a list of strings. This is an example template you can use:
    OutputAdapter:
          type: haystack.components.converters.output_adapter.OutputAdapter
          init_parameters:
            template: |-
              {% set ns = namespace(str_list=[]) %} 
              {% for document in documents %} 
                {% set _ = ns.str_list.append(document.content) %} 
              {% endfor %} 
              {{ ns.str_list }}
    
  3. Send the resulting list of strings to DeepsetFirecrawlWebScraper.

Authorization

To use this component, you must have an active Firecrawl API key. Set this key as the FIRECRAWL_API_KEY secret in deepset Cloud. For more information about secrets, see Add Secrets to Connect to Third Party Providers.

Usage Example

This example shows an indexing pipeline that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.

The pipeline starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.

Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.

components:
  OutputAdapter:
      type: haystack.components.converters.output_adapter.OutputAdapter
      init_parameters:
        template: |-
          {% set ns = namespace(str_list=[]) %} 
          {% for document in documents %} 
            {% set _ = ns.str_list.append(document.content) %} 
          {% endfor %} 
          {{ ns.str_list }}
        output_type: typing.List[str]
  DocumentWriter:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 768
            similarity: cosine
         policy: NONE
   DeepsetCSVRowsToDocumentsConverter:
      type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
      init_parameters:
        content_column: urls
        encoding: utf-8
   DeepsetFirecrawlWebScraper:
      type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
      init_parameters:
        params: null
  connections:  # Defines how the components are connected
  - sender: DeepsetCSVRowsToDocumentsConverter.documents
    receiver: OutputAdapter.documents
  - sender: OutputAdapter.output
    receiver: DeepsetFirecrawlWebScraper.urls
  - sender: DeepsetFirecrawlWebScraper.documents
    receiver: DocumentWriter.documents

 max_loops_allowed: 100
  metadata: {}
  inputs:
    files: DeepsetCSVRowsToDocumentsConverter.sources

Init Parameters

ParameterTypePossible valuesDescription
paramsDictionaryDefault: {"limit": 1, "scrapeOptions": {"formats": ["markdown"]}}Parameters for the crawl request. For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation.
It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges.
Required.
api_keySecretDefault: Secret.from_env_var("FIRECRAWL_API_KEY")The API key for Firecrawl. By default, it's read from the FIRECRAWL_API_KEY environment variable.
Required.