DeepsetFirecrawlWebScraper
Use the Firecrawl service to crawl websites and return the crawled documents.
Key Features
- Crawls websites using the paid Firecrawl service.
- Returns crawled content in Markdown format, optimized for use with LLMs.
- Accepts URLs from a CSV file processed through
DeepsetCSVRowsToDocumentsConverterandOutputAdapter.
When using this component, set a page crawl limit using the limit parameter. Without a limit, Firecrawl crawls all subpages, which can result in high charges.
Configuration
- Drag the
DeepsetFirecrawlWebScrapercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- On the General tab:
- Set your Firecrawl API key. Add
FIRECRAWL_API_KEYas a secret in Haystack Enterprise Platform. For more information, see Add Secrets to Connect to Third Party Providers. - Configure
paramswith at least alimitvalue to cap the number of pages crawled. For a full list of accepted parameters, see the Firecrawl crawl-post endpoint documentation.
- Set your Firecrawl API key. Add
- Go to the Advanced tab to configure additional crawl parameters.
Passing the URLs
DeepsetFirecrawlWebScraper needs to receive URLs to crawl in a CSV file. The file must contain a column called urls where each row contains a URL to crawl. Structure your indexing pipeline as follows:
- Start with
DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object. - Connect
DeepsetCSVRowsToDocumentsConvertertoOutputAdapterwith its template set to turn a list of documents into a list of strings. This is an example template you can use:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
- Send the resulting list of strings to
DeepsetFirecrawlWebScraper.
Connections
DeepsetFirecrawlWebScraper receives a list of URL strings from OutputAdapter. It outputs a list of documents containing the scraped content through its documents output, which you connect to DocumentWriter to write them into a document store.
Usage Examples
Basic Configuration
DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
api_key:
type: env_var
env_vars:
- FIRECRAWL_API_KEY
strict: false
Using the Component in a Pipeline
This example shows an index that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.
The index starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.
Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.
components:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
output_type: typing.List[str]
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: NONE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: urls
encoding: utf-8
DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
params: null
connections: # Defines how the components are connected
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: OutputAdapter.documents
- sender: OutputAdapter.output
receiver: DeepsetFirecrawlWebScraper.urls
- sender: DeepsetFirecrawlWebScraper.documents
receiver: DocumentWriter.documents
max_loops_allowed: 100
metadata: {}
inputs:
files: DeepsetCSVRowsToDocumentsConverter.sources
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | URLs of the websites to crawl. | |
| params | Dict | None | None | Parameters for the crawl request. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents containing the crawled content. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| params | Dict | None | For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation. It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges. | |
| api_key | Secret | Secret.from_env_var('FIRECRAWL_API_KEY') | API key for Firecrawl. If not provided, it's read from the FIRECRAWL_API_KEY environment variable. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| urls | List[str] | URLs of the websites to crawl. | |
| params | Dict | None | None | Parameters for the crawl request. |
Related Information
Was this page helpful?