DeepsetOpenSearchFilterParser

Parse and combine filters for use with OpenSearch. This component takes filters from the Generator's output and combines them with the filters provided at runtime.

Basic Information

  • Pipeline type: Query
  • Type: deepset_cloud_custom_nodes.augmenters.deepset_filter_parser.DeepsetOpenSearchFilterParser
  • Components it can connect with:
    • Generators: It parses filters from the Generator's output and combines them with the filters provided at runtime.
    • Any component that accepts filters as input, such as Retrievers.

Inputs

Required Inputs

NameTypeDescription
repliesList of stringsThe output of the Generator DeepsetOpenSearchFilterParser is connected to.

Optional Inputs

NameTypePossible valuesDescription
filtersDictionary of string and anyDefault: NoneOptional filters to narrow down the search space to the documents whose metadata meet the filter conditions.
logical_operatorLiteralAND
OR
NOT
Default: None
The logical operator to use when combining the parsed filters with the existing filters.
raise_on_failureBooleanTrue
False
Default: None
Raises an error it the filter can't be parsed or converted to the OpenSearch format. If set to None, uses the value provided in the pipeline configuration.

Outputs

NameTypeDescription
filtersDictionaryCombined filters.

Overview

DeepsetOpenSearchFilterParser combines runtime filters with filters from the Generator as follows:

  • Two logical filters are combined by merging their conditions.
  • A comparison filter and a logical filter are combined based on the specified logical operator.
  • Two comparison filters are combined into a new logical filter.

Once DeepsetOpenSearchFilterParser receives the filters, it validates them to ensure they conform to the expected format. Each filter from the Generator must contain one comparison filter (for example, {"field": "meta.type", "operator": "==", "value": "article"}) or one logical filter (for example {"operator": "AND", "conditions": [{"field": "meta.type", "operator": "==", "value": "article"}]}).

Usage Example

This is an example pipeline where we first extract the date from the LLM, then pass it to DeepsetOpenSearchFilterParser to ensure the date conforms to the OpenSearch format, and finally pass the date to the retrieval for filtering:

components:
  chat_summary_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        Rewrite the following question so that it is suitable for web search.
        Be cautious when reformulating. Strong changes distort the meaning of the question, which is undesirable.
        It is possible that the question does not need any changes.
        The chat history can help to incorporate context into the reformulated question.
        Make sure to incorporate that chat history into the revised question if needed.
        The meaning of the question must remain the same as before.
        You cannot change or dismiss keywords in the original question.
        If you do not want to make changes, just output the original question.
        Chat History: {{question}}
        Revised Question:

  chat_summary_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      model: gpt-4o
      generation_kwargs:
        max_tokens: 650
        temperature: 0
        seed: 0

  replies_to_query:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: '{{ replies[0] }}'
      output_type: str

  extract_date_template:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are part of an information system that processes user queries. 
        When given a user query, extract dates from it and use them as values for the metadata field 'post_date_gmt'.
        The extracted information must be in JSON format and aligns with OpenSearch filter structure, similar to the examples below. The extracted metadata fields must be in double quotes. 
        If you are unable to extract the dates from the query, return {}. 
        The extracted metadata fields will be used as filters to narrow down the search space when querying an index.  
        The extracted metadata fields must be in double quotes, and dates must follow the following format: YYYY-MM-DD.  
        If the question requires you to know today's date for any calculations or references, use {current_datetime(format='%Y-%m-%d')} as the current date. 
        This is particularly useful for tasks such as determining past or future dates. 
        Note also that $gte means greater than or equal to and $lte means less than or equal to.
        Example 1: 
        Query: What was the revenue of Nvidia in 2022? 

        Extracted metadata fields: {'operator': 'AND', 'conditions': [ {'field': 'post_date_gmt', 'operator': '>=', 'value': '2022-01-01'}, {'field': 'post_date_gmt', 'operator': '<=', 'value': '2022-12-31'}]}
        Example 2: 
        Query: What were the most influential publications between 2020 and 2023? 
        Extracted metadata fields: {'operator': 'AND', 'conditions': [ {'field': 'post_date_gmt', 'operator': '>=', 'value': '2020-01-01'}, {'field': 'post_date_gmt', 'operator': '<=', 'value': '2023-12-31'}]}

        Example 3: 
        Query: How did the stock market perform in the 10 days following the crash on October 29, 1929? 
        Extracted metadata fields: {'operator': 'AND', 'conditions': [ {'field': 'post_date_gmt', 'operator': '>=', 'value': '1929-10-29'}, {'field': 'post_date_gmt', 'operator': '<=', 'value': '1929-11-08'}]}

        Example 4: 
        Query: What were the key activities during the Apollo 11 mission from launch on July 16 to splashdown on July 24, 1969? 
        Extracted metadata fields: {'operator': 'AND', 'conditions': [ {'field': 'post_date_gmt', 'operator': '>=', 'value': '1969-07-16'}, {'field': 'post_date_gmt', 'operator': '<=', 'value': '1969-07-24'}]}

        Example 5:  
        Query: What were the major social events during the 2000s?  
        Extracted metadata fields: {'operator': 'AND', 'conditions': [ {'field': 'post_date_gmt', 'operator': '>=', 'value': '2000-01-01'}, {'field': 'post_date_gmt', 'operator': '<=', 'value': '2009-12-31'}]}

        Example 6: 
        After the 2008 US presidential election, what were the major US economic changes that occurred?
        Extracted metadata fields: {'field': 'post_date_gmt', 'operator': '>=', 'value': '2008-01-01'}

        Query: {{question}}
        Extracted metadata fields:

  date_extract_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 250
        temperature: 0.0
        seed: 0
        response_format:
          type: "json_object"

  date_parser:
    type: deepset_cloud_custom_nodes.augmenters.deepset_filter_parser.DeepsetOpenSearchFilterParser

  query_embedder:
    type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
    init_parameters:
      model: "intfloat/e5-large-v2"
      device: null

  embedding_retriever:
    # Selects the most similar documents from the document store
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      filters:
        'field': 'post_type'
        'operator': '!='
        'value': 'wp_ypulse_brand'
      document_store:
        init_parameters:
          embedding_dim: 1024
          use_ssl: True
          verify_certs: False
          http_auth:
            - "${OPENSEARCH_USER}"
            - "${OPENSEARCH_PASSWORD}"
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
      top_k: 30 # The number of results to return

  reranker:
    type: haystack.components.rankers.transformers_similarity.TransformersSimilarityRanker
    init_parameters:
      model: "intfloat/simlm-msmarco-reranker"
      top_k: 15
      device: null
      model_kwargs:
        torch_dtype: "torch.float16"

  recency_ranker:
    type: haystack.components.rankers.meta_field.MetaFieldRanker
    init_parameters:
      meta_field: "post_date_gmt"
      weight: 0.8
      top_k: 8
      ranking_mode: linear_score
      sort_order: descending
      missing_meta: bottom

  qa_prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: |-
        You are a consumer insights specialist and an expert on helping brands authentically understand young people.
        Your goal is to help brands better understand young people and their behaviors. 
        You want to help brands strategically leverage the information from our corpus of content to better engage with young people.
        You answer questions truthfully based on provided documents.
        Analyze the given question, considering all relevant implications and perspectives. Provide a nuanced response that incorporates contextual background, potential outcomes, ethical considerations, and supporting evidence.
        You understand that other terms for young people in the documents are youth and young consumers.
        You understand that Gen Z and Millennials are considered young people in the documents.
        You understand the following rules:
        - 18 to 24 year olds are considered Young Adults.
        - 13 to 17 year olds are considered Teens.
        - 8 to 12 year olds are considered Tweens.
        Gen Z typically refers to young people either ages 13-22 years old. 
        You understand that young POC and BIPOC refer to people that are considered
        Young People of Color in the documents.
        If asked a question about Hispanics, provide insights on young POC to answer the questions.
        If asked asked a question about Blacks or African-Americans, provide insights on young POC to answer the questions.
        If relevant information on Hispanics or African-Americans isn't available
        in the documents but is available for POC, provide the POC insigths and say: 'Exact insights on young Hispanics and African-Americans can be found in the YPulse Data Files'.
        You understand that YPulse does not have any information on Gen X, Boomers, or people over the age of 40.
        If information on a specific audience is not available in the documents, answer with the most relevant information for either Gen Z, Millennials, 13-39-year-olds, or all young people.
        Always state the names and age ranges of the audiences you reference in your response. Here's an example for how to respond:
        - Teens (13-17-year-olds) report using Tiktok everyday whereas Millennials only use the app twice per week.
        You should use the most relevant information on Gen Z if relevant information on
        Teens or Young Adults is not available in the documents.
        You should use the most relevant information on Teens or Young Adults if relevant information on
        Gen Z is not available in the documents.
        If relevant information on Students isn't available in the documents but is available for Teens or Gen Z, provide the Teens or Gen Z insights and say: 'Exact insights on Students can be found in the YPulse Data Files'.
        If relevant information on Parents isn't available in the documents but is available for Millennials, provide the Millennial insights and say: 'Exact insights on Parents can always be found in the YPulse Data Files'.
        If referencing insights captured from the YPulse Brand Tracker, cite the date range for which the data was collected.
        For each document check whether it is related to the question.
        Only use documents that are related to the question to answer it.
        Ignore documents that are not related to the question.
        If the answer exists in several documents, summarize them.
        Only answer based on the documents provided. Don't make things up.
        Use data and statistics to support your answers where possible.
        Highlight emerging trends in consumer behavior where appropriate.
        Include examples of brands and products where appropriate.
        Ensure examples are relevant and provide clear learning points or insights.
        If making predictions or speculations that do not exist directly in the text, alwawys say 'YPulse AI predicts'.
        If making a recommendation that does not exist directly in the text, alwawys say 'YPulse AI recommends'.
        Recommendations should be actionable and practical considering the scale and resources for businesses or marketers.
        If referencing a prediction that directly exists in the text, always say 'YPulse Editorial predicts'.
        If referencing a recommendation that directly exists in the text, always say 'YPulse Editorial recommends'.
        If the question can be answered completely with a concise response and relevant statistics, you may do that.
        If the question requires a complex response, please provide a detailed explanation of key points on the given topic. Please format the response as follows:
        1. Start with a concise these statement as an introduction. Please decide what makes most sense here.
        2. Each key point should have a title. But do not limit it to three points
        3. Follow the title with a well-organized paragraph explaining the point in detail.
        4. Ensure the content is neatly organized in paragraphs.
        Here is the structure to follow:
        Write a short introductory paragraph teasing at the findings.
        **Title of Key Point One**: Write a clear and concise paragraph that elaborates on the first key point, including any relevant examples or additional information.
        **Title of Key Point Two**: Write a clear and concise paragraph that elaborates on the second key point, including any relevant examples or additional information.
        **Title of Key Point Three**: Write a clear and concise paragraph that elaborates on the third key point, including any relevant examples or additional information.
        (Repeat the format above for additional key points as necessary.)
        Ensure each section is well-structured, informative, and easy to read.
        You are expected to follow strict American English grammar and punctuation rules in all responses. Please ensure that you:
        - Use proper sentence structure and verb tenses.
        - Employ correct spelling and punctuation.
        - Apply appropriate formatting for lists, using bullet points where necessary.
        - Ensure that paragraphs are well-organized and that each new point begins on a new line.
        Always use references in the form [NUMBER OF DOCUMENT] when using information from a document. e.g. [3], for Document[3].
        The reference must only refer to the number that comes in square brackets after passage.
        Otherwise, do not use brackets in your answer and reference ONLY the number of the passage without mentioning the word passage.
        If the documents can't answer the question or you are unsure say:
        'The specific answer can't be found in the YPulse text. Try rephrasing your question or check out the sources below for related content that should help.'.
        If you are asked about questions or survey questions that you have available or that YPulse asks in its surveys, use information from documents with post_type of wp_ypulse_question_list and please respond using the following rules:
        - List the survey questions in their exact format and do not summarize or paraphrase the text
        - Always respond with all of the questions available that match the user's query
        - Always include the question number or label associated with each survey question. Please use this example as reference: e.g. [H145. To what extent do you agree or disagree with the following statements about working out / fitness?]
        - Always use titles in the form [TITLE OF DOCUMENT] when using information from a document. e.g. NA-2024-06-12-YPulse-Behavioral-Report-Health-And-Fitness-Report, for Document[3].
        - Always include the report name of the document where each question is from. Please follow this example:
        Here are the survey questions asked for NA-2024-06-12-YPulse-Behavioral-Report-Health-And-Fitness-Report: 
        H145. To what extent do you agree or disagree with the following statements about working out / fitness?
        If referencing information published before 2024, you understand the following rules:
        - you must always cite the year the information was published.
        - you must refer to the information in past tense.
        Today's date is {current_datetime(format='%Y-%m-%d')}.
        Events before {current_datetime(format='%Y-%m-%d')} are in the past.
        Events after {current_datetime(format='%Y-%m-%d')} are in the future.
        Prioritize newer information over older information to answer the question.
        The publication date of a document is at the beginning of every document, compare this with today's date ({current_datetime(format='%Y-%m-%d')}) to determine the age of the document.
        These are the documents:
        {% for document in documents %}
        Document[{{ loop.index }}]:
        Title: {{ document.meta["post_title"] }}
        Page: {{ document.meta["page"] }}
        Publication Date: {{ document.meta["post_date_gmt"] }}
        Post type: {{ document.meta["post_type"] }}
        {{ document.content }}
        {% endfor %}
        Question: {{ question }}
        Answer:

  qa_llm:
    type: haystack.components.generators.openai.OpenAIGenerator
    init_parameters:
      model: "gpt-4o"
      generation_kwargs:
        max_tokens: 1500
        temperature: 0.0
        seed: 0



  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder

connections:
  - sender: chat_summary_prompt_builder.prompt
    receiver: chat_summary_llm.prompt

  - sender: chat_summary_llm.replies
    receiver: replies_to_query.replies

  - sender: replies_to_query.output
    receiver: query_embedder.text

  - sender: query_embedder.embedding
    receiver: embedding_retriever.query_embedding

  - sender: replies_to_query.output
    receiver: extract_date_template.question

  - sender: extract_date_template.prompt
    receiver: date_extract_llm.prompt

  - sender: date_extract_llm.replies
    receiver: date_parser.replies

  - sender: date_parser.filters
    receiver: embedding_retriever.filters

  - sender: query_embedder.embedding
    receiver: embedding_retriever.query_embedding

  - sender: replies_to_query.output
    receiver: reranker.query

  - sender: embedding_retriever.documents
    receiver: reranker.documents

  - sender: reranker.documents
    receiver: recency_ranker.documents

  - sender: recency_ranker.documents
    receiver: qa_prompt_builder.documents

  - sender: replies_to_query.output
    receiver: qa_prompt_builder.question

  - sender: qa_prompt_builder.prompt
    receiver: qa_llm.prompt

  - sender: qa_llm.replies
    receiver: answer_builder.replies

  - sender: qa_prompt_builder.prompt
    receiver: answer_builder.prompt


max_loops_allowed: 100
metadata: {}
inputs:
  query:
    - "chat_summary_prompt_builder.question"

outputs:
  answers: "answer_builder.answers"
  documents: "recency_ranker.documents"
  

Init Parameters

ParameterTypePossible ValuesDescription
logical_operatorLiteralAND
OR
NOT
Default: AND
The logical operator to use when combining the parsed filter with existing filters.
Required.
raise_on_failureBooleanTrue
False
Default: False
Raises an error if the filter can't be parsed or converted to the OpenSearch format.
Optional.