DeepsetFirecrawlWebScraper
Use the Firecrawl service to crawl websites and return the crawled documents.
Basic Information
- Pipeline type: Indexing
- Type:
deepset_cloud_custom_nodes.crawler.firecrawl.FirecrawlComponent
- Components it can connect with:
- DeepsetFirecrawlWebScraper can receive URLs from DeepsetCSVRowsToDocumentsConverter through OutputAdapter, which transforms the list of documents the converter produces into a list of strings the scraper can accept.
- DeepsetFirecrawlWebScraper can send the scraped documents to the DocumentWriter to write them into the document store.
Inputs
Name | Type | Description |
---|---|---|
urls | List of strings | The URLs to crawl. Required. |
params | Dictionary of string and any | The parameters for the crawl request. |
Outputs
Name | Type | Description |
---|---|---|
documents | List of Document objects | Documents containing the crawled content. |
Overview
The DeepsetFirecrawlWebScraper crawls websites using the paid Firecrawl service. Firecrawl returns the content in Markdown format, optimized for use with LLMs. For more details, see Firecrawl.
Page Crawl Limits
When using this component, be sure to set a page crawl limit, as Firecrawl will otherwise crawl all subpages, potentially leading to high charges. You can do this using the
limit
parameter. For details, see the Init Parameters section below.
Passing the URLs
DeepsetFirecrawlWebScraper needs to receive the URLs to crawl in a CSV file. The file must contain a column called urls
where each row contains a URL to crawl. For DeepsetFirecrawlWebScraper to be able to take in the documents, you should structure your indexing pipeline as follows:
- Start with DeepsetCSVRowsToDocumentsConverter. It converts each row in a CSV file into a document object.
- Connects DeepsetCSVRowsToDocumentsConverter to OutputAdapter with its template set to turn a list of documents into a list of strings. This is an example template you can use:
OutputAdapter: type: haystack.components.converters.output_adapter.OutputAdapter init_parameters: template: |- {% set ns = namespace(str_list=[]) %} {% for document in documents %} {% set _ = ns.str_list.append(document.content) %} {% endfor %} {{ ns.str_list }}
- Send the resulting list of strings to DeepsetFirecrawlWebScraper.
Authorization
To use this component, you must have an active Firecrawl API key. Set this key as the FIRECRAWL_API_KEY
secret in deepset Cloud. For more information about secrets, see Add Secrets to Connect to Third Party Providers.
Usage Example
This example shows an indexing pipeline that scrapes content from websites using Firecrawl, converts it into documents, and writes them to an OpenSearch document store for access by the query pipeline.
The pipeline starts with a CSV file containing the URLs to crawl. This file is sent to DeepsetCSVRowsToDocumentsConverter, which converts each row into a document and sends it to OutputAdapter.
Using a Jinja2 template, OutputAdapter transforms these documents into a list of strings that DeepsetFirecrawlWebScraper can process. DeepsetFirecrawlWebScraper then scrapes content from each URL, converts it into documents, and sends them to DocumentWriter, which writes the documents to the OpenSearch store.
components:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(str_list=[]) %}
{% for document in documents %}
{% set _ = ns.str_list.append(document.content) %}
{% endfor %}
{{ ns.str_list }}
output_type: typing.List[str]
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: NONE
DeepsetCSVRowsToDocumentsConverter:
type: deepset_cloud_custom_nodes.converters.csv_rows_to_documents.DeepsetCSVRowsToDocumentsConverter
init_parameters:
content_column: urls
encoding: utf-8
DeepsetFirecrawlWebScraper:
type: deepset_cloud_custom_nodes.crawler.firecrawl.DeepsetFirecrawlWebScraper
init_parameters:
params: null
connections: # Defines how the components are connected
- sender: DeepsetCSVRowsToDocumentsConverter.documents
receiver: OutputAdapter.documents
- sender: OutputAdapter.output
receiver: DeepsetFirecrawlWebScraper.urls
- sender: DeepsetFirecrawlWebScraper.documents
receiver: DocumentWriter.documents
max_loops_allowed: 100
metadata: {}
inputs:
files: DeepsetCSVRowsToDocumentsConverter.sources
Init Parameters
Parameter | Type | Possible values | Description |
---|---|---|---|
params | Dictionary | Default: {"limit": 1, "scrapeOptions": {"formats": ["markdown"]}} | Parameters for the crawl request. For a list of accepted parameters, see the Body section of the crawl-post endpoint documentation. It's important to set the limit parameter defining the maximum number of pages to crawl. Otherwise, Firecrawl crawls all subpages, which results in high charges.Required. |
api_key | Secret | Default: Secret.from_env_var("FIRECRAWL_API_KEY") | The API key for Firecrawl. By default, it's read from the FIRECRAWL_API_KEY environment variable.Required. |
Updated about 1 month ago