DeepsetFirecrawlWebScraper
Crawls websites using the Firecrawl service and returns the crawled content as documents in Markdown format.
When using this component, be sure to set a page crawl limit, as Firecrawl will otherwise crawl all subpages, potentially leading to high charges. You can do this using the limit parameter. For details, see the Init Parameters section below.
Key Features
- Crawls websites using the paid Firecrawl service.
- Returns crawled content as
Documentobjects in Markdown format, optimized for use with LLMs. - Accepts a list of URLs to crawl.
- Configurable crawl parameters including page limits and crawl options.
- Reads the Firecrawl API key from the
FIRECRAWL_API_KEYsecret automatically. - Designed to work with
DeepsetCSVRowsToDocumentsConverterandOutputAdapterto process URLs from CSV files.
Configuration
To use this component, set your Firecrawl API key as the FIRECRAWL_API_KEY secret in Haystack Enterprise Platform. For more information about secrets, see Add Secrets to Connect to Third Party Providers.
- Drag the
DeepsetFirecrawlWebScrapercomponent onto the canvas from the Component Library. - Click the component to open the configuration panel.
- Configure the parameters as needed.
Connections
DeepsetFirecrawlWebScraper accepts a list of URL strings (urls) and optional crawl parameters (params) as inputs. It outputs a list of Document objects (documents) containing the scraped content.
This component needs to receive URLs in a CSV file. The CSV must have a urls column where each row contains a URL. Structure your pipeline as follows:
- Use
DeepsetCSVRowsToDocumentsConverterto convert each CSV row into a document. - Connect it to
OutputAdapterwith a Jinja2 template that transforms the document list into a list of strings. - Connect the
OutputAdapteroutput to theurlsinput ofDeepsetFirecrawlWebScraper. - Connect the
documentsoutput toDocumentWriterto store the scraped content.
Usage Example
Using the Component in a Pipeline
This example shows an index that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.
The index starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.
Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.
components:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
output_type: typing.List[str]
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: NONE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: urls
encoding: utf-8
DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
params: null
connections: # Defines how the components are connected
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: OutputAdapter.documents
- sender: OutputAdapter.output
receiver: DeepsetFirecrawlWebScraper.urls
- sender: DeepsetFirecrawlWebScraper.documents
receiver: DocumentWriter.documents
max_loops_allowed: 100
metadata: {}
inputs:
files: DeepsetCSVRowsToDocumentsConverter.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | URLs of the websites to crawl. | |
| params | Dict | None | None | Parameters for the crawl request. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing the crawled content. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| params | Dict | None | For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation. It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges. |
| api_key | Secret | Secret.from_env_var('FIRECRAWL_API_KEY') | API key for Firecrawl. If not provided, it's read from the FIRECRAWL_API_KEY environment variable. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | URLs of the websites to crawl. | |
| params | Dict | None | Parameters for the crawl request. |
Was this page helpful?