Skip to main content

CacheChecker

Check if documents exist in a Document Store based on a specified cache field in their metadata.

Basic Information

  • Type: components.caching.cache_checker.CacheChecker
  • Components it can connect with:
    • LinkContentFetcher: CacheChecker can receive a list of items to check from a LinkContentFetcher.
    • Any component that accepts a list of documents or a list of items to process.

Inputs

ParameterTypeDefaultDescription
itemsList[Any]Values to be checked against the cache field.

Outputs

ParameterTypeDefaultDescription
hitsList[Document]Documents that matched with at least one of the items.
missesListItems that were not present in any documents.

Overview

CacheChecker determines which items are already cached (stored) in a document store and which need to be processed. For each item, it searches the document store using a specified metedata field (cache_field). If there are documents with that field value, they're returned as "hits". If it finds no documents, the item is returned as "miss".

CacheChecker returns two lists as output:

  • hits: Documents that were already in the cache.
  • misses: Items that weren't found and need to be processed.

Usage Example

Initializing the Component

components:
CacheChecker:
type: components.caching.cache_checker.CacheChecker
init_parameters:

Using the Component in an Index

In this example, CacheChecker checks if documents for the given URLs are already scraped and stored in a document store. The URLs that are already processed skip scraping, while new URLs (misses) are sent to LinkContentFetcher:

components:
LinkContentFetcher:
type: haystack.components.fetchers.link_content.LinkContentFetcher
init_parameters:
raise_on_failure: true
user_agents:
retry_attempts: 2
timeout: 3
http2: false
client_kwargs:
OutputAdapter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: '{{ sources[0] }}'
output_type: List[str]
custom_filters:
unsafe: false
CacheChecker:
type: haystack.components.caching.cache_checker.CacheChecker
init_parameters:
cache_field: code_of_conduct
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: Standard-Index-English
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false

connections:
- sender: CacheChecker.misses
receiver: OutputAdapter.sources
- sender: OutputAdapter.output
receiver: LinkContentFetcher.urls
- sender: LinkContentFetcher.streams
receiver: TextFileToDocument.sources

max_runs_per_component: 100

metadata: {}

inputs:
files:
- CacheChecker.items

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
document_storeDocumentStoreDocument Store to check for the presence of specific documents.
cache_fieldstrName of the document's metadata field to check for cache hits.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
itemsList[Any]Values to be checked against the cache field.