CacheChecker
Check if documents exist in a Document Store based on a specified cache field in their metadata.
Basic Information
- Type:
components.caching.cache_checker.CacheChecker - Components it can connect with:
LinkContentFetcher:CacheCheckercan receive a list of items to check from aLinkContentFetcher.- Any component that accepts a list of documents or a list of items to process.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| items | List[Any] | Values to be checked against the cache field. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| hits | List[Document] | Documents that matched with at least one of the items. | |
| misses | List | Items that were not present in any documents. |
Overview
CacheChecker determines which items are already cached (stored) in a document store and which need to be processed. For each item, it searches the document store using a specified metedata field (cache_field). If there are documents with that field value, they're returned as "hits". If it finds no documents, the item is returned as "miss".
CacheChecker returns two lists as output:
hits: Documents that were already in the cache.misses: Items that weren't found and need to be processed.
Usage Example
Initializing the Component
components:
CacheChecker:
type: components.caching.cache_checker.CacheChecker
init_parameters:
Using the Component in an Index
In this example, CacheChecker checks if documents for the given URLs are already scraped and stored in a document store. The URLs that are already processed skip scraping, while new URLs (misses) are sent to LinkContentFetcher:
components:
LinkContentFetcher:
type: haystack.components.fetchers.link_content.LinkContentFetcher
init_parameters:
raise_on_failure: true
user_agents:
retry_attempts: 2
timeout: 3
http2: false
client_kwargs:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: '{{ sources[0] }}'
output_type: List[str]
custom_filters:
unsafe: false
CacheChecker:
type: haystack.components.caching.cache_checker.CacheChecker
init_parameters:
cache_field: code_of_conduct
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: Standard-Index-English
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
connections:
- sender: CacheChecker.misses
receiver: OutputAdapter.sources
- sender: OutputAdapter.output
receiver: LinkContentFetcher.urls
- sender: LinkContentFetcher.streams
receiver: TextFileToDocument.sources
max_runs_per_component: 100
metadata: {}
inputs:
files:
- CacheChecker.items
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| document_store | DocumentStore | Document Store to check for the presence of specific documents. | |
| cache_field | str | Name of the document's metadata field to check for cache hits. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| items | List[Any] | Values to be checked against the cache field. |
Was this page helpful?