Multimodal Systems

Multimodal systems can process, understand, and generate information across multiple types such as text, images, audio, and video. Learn what's possible in deepset AI Platform.

Overview

You can build systems that combine multiple data types and formats. These can range from simple setups (such as transcribing speech to text or generating image captions) to more advanced ones that process and analyze videos. Such systems have a variety of applications. They can give your AI assistants new capabilities but also help people with disabilities.

Types of Multimodal Systems

Below are common types of multimodal systems you can build with deepset using existing components.

Audio-Based Systems

With the deepset AI Platform, you can build speech to text systems that take audio input and return textual answers. The steps to building such systems include:

  1. Uploading audio files to a deepset workspace.
  2. Preprocessing audio files with a transcriber component, like RemoteWhisperTranscriber, that converts the audio into text documents.
  3. Writing the resulting documents into a document store so that your query pipeline can retrieve them.
  4. Build a query pipeline that answers questions based on those transcribed documents.

Example

This is an example index that transcribes audio files using RemoteWhisperTranscriber and writes the transcribed documents into a document store:


components:
  file_classifier:
    type: haystack.components.routers.file_type_router.FileTypeRouter
    init_parameters:
      mime_types:
      - text/plain
      - application/pdf
      - audio/wav
  splitter:
    type: haystack.components.preprocessors.document_splitter.DocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: true
      language: en

  document_embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
    init_parameters:
      normalize_embeddings: true
      model: intfloat/e5-base-v2

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: ''
          max_chunk_bytes: 104857600
          embedding_dim: 768
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      policy: OVERWRITE

  RemoteWhisperTranscriber:
    type: haystack.components.audio.whisper_remote.RemoteWhisperTranscriber
    init_parameters:
      api_key:
        type: env_var
        env_vars:
        - OPENAI_API_KEY
        strict: false
      model: whisper-1
      api_base_url:
      organization:
      http_client_kwargs:

connections:  # Defines how the components are connected
- sender: document_embedder.documents
  receiver: writer.documents
- sender: file_classifier.audio/wav
  receiver: RemoteWhisperTranscriber.sources
- sender: splitter.documents
  receiver: document_embedder.documents
- sender: RemoteWhisperTranscriber.documents
  receiver: splitter.documents

inputs:  # Define the inputs for your pipeline
  files:                            # This component will receive the files to index as input
  - file_classifier.sources

max_runs_per_component: 100

metadata: {}

Image-Based Systems

You can easily create systems that process, analyze, or generate images. Some examples include:

  • Visual question answering (ask questions about image content, including scanned documents)
  • Image generation (create images from textual descriptions)
  • Image analysis or classification (compare, classify, or interpret images)

Working in images often requires specialized models, like DALL-E, for generating images. deepset AI Platform is model-agnostic so you can easily try out different models

To work with images, you may need a special index. deepset offers two index templates designed specifically for visual search that you can use out-of-the-box:

  • Image-to-Text: Uses the Azure's Document Intelligence OCR service to extract text from PDF files. Use this template if you want to run OCR on your PDFs.
  • Visual Search: Processes images by extracting their text descriptions. It also processes PDF files by splitting each PDF by page and checking each page for text content that can be extracted.
    • If the is not text content, the page is sent to a vision LLM that extracts its content. The extracted content is then send to the Embedder and indexed into the document store.
    • If the is text content, it's sent to the Embedder and indexed into the document store.

Both templates are available for English and German.

Once your data is indexed, you can build a query pipeline that prompts an LLM to operate on the images

Example: Visual Question Answering

Here's how you could build a system to answer questions about images:

  1. First, create an index using the Visual Search template.
  2. Then, build a query pipeline using one of the Visual RAG Question Answering templates.

This is an example index that prepares files for visual search. It uses a vision LLM to extract the content of PDF files. The documents resulting from PDFs are then split and written into the OpenSearch document store. It doesn't split images.

  components:
    FileTypeRouter:
      type: haystack.components.routers.file_type_router.FileTypeRouter
      init_parameters:
        mime_types:
        - application/pdf
        - image/jpg
        - image/jpeg
        - image/png
        - image/gif

    PDFConverter:
      type: haystack.components.converters.pdfminer.PDFMinerToDocument
      init_parameters:
        line_overlap: 0.5
        char_margin: 2
        line_margin: 0.5
        word_margin: 0.1
        boxes_flow: 0.5
        detect_vertical: true
        all_texts: false
        store_full_path: false

    PageSplitter:
      type: haystack.components.preprocessors.document_splitter.DocumentSplitter
      init_parameters:
        split_by: page
        split_length: 1
        split_overlap: 0
        respect_sentence_boundary: false
        language: en
        use_split_rules: false
        extend_abbreviations: false

    ContentFilter:
      type: haystack.components.routers.document_length_router.DocumentLengthRouter
      init_parameters:
        threshold: 1

    ImageSourceListJoiner:
      type: haystack.components.joiners.list_joiner.ListJoiner
      init_parameters:
        list_type_: List[Union[str, pathlib.Path, haystack.dataclasses.ByteStream]]

    ImageFileToDocument:
      type: haystack.components.converters.image.file_to_document.ImageFileToDocument
      init_parameters:
        store_full_path: true                              

    DocumentJoinerForExtraction:
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate

    FileDownloader:
      type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
      init_parameters:
        file_extensions:
        sources_target_type: str
        max_cache_size: 100

    LLMDocumentContentExtractor:
      type: haystack.components.extractors.image.llm_document_content_extractor.LLMDocumentContentExtractor
      init_parameters:
        chat_generator:
          type: haystack.components.generators.chat.openai.OpenAIChatGenerator
          init_parameters:
            model: gpt-4o
            timeout: 120
            generation_kwargs:
              max_tokens: 16384
              temperature: 0
        prompt: |
                You are part of an information extraction pipeline that extracts the content of image-based documents.
                Extract the content from the provided image.
                You need to extract the content exactly.
                Format everything as markdown.
                Make sure to retain the reading order of the document.

                **Headers- and Footers**
                Remove repeating page headers or footers that disrupt the reading order.
                Place letter heads that appear at the side of a document at the top of the page.

                **Visual Elements**
                Do not extract figures, drawings, maps, graphs or any other visual elements.
                Instead, add a caption that describes briefly what you see in the visual element.
                You must describe each visual element.
                If you only see a visual element without other content, you must describe this visual element.
                Enclose each image caption with [img-caption][/img-caption]

                **Tables**
                Make sure to format the table in markdown.
                Add a short caption below the table that describes the table's content.
                Enclose each table caption with [table-caption][/table-caption].
                The caption must be placed below the extracted table.

                **Forms**
                Reproduce checkbox selections with markdown.

                Go ahead and extract!
                
                Document:
        
        file_path_meta_field: file_path
        root_path:
        detail:
        size:
        raise_on_failure: true
        max_workers: 4

    DocumentJoiner:
      type: haystack.components.joiners.document_joiner.DocumentJoiner
      init_parameters:
        join_mode: concatenate        

    Embedder:
      type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
      init_parameters:
        model: BAAI/bge-m3
        normalize_embeddings: true

    DocumentWriter:
      type: haystack.components.writers.document_writer.DocumentWriter
      init_parameters:
        document_store:
          type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
          init_parameters:
            embedding_dim: 1024
        policy: OVERWRITE
    

  connections:  # Defines how the components are connected
  - sender: FileTypeRouter.application/pdf
    receiver: PDFConverter.sources
  - sender: PDFConverter.documents
    receiver: PageSplitter.documents
  - sender: PageSplitter.documents
    receiver: ContentFilter.documents
  - sender: ContentFilter.long_documents
    receiver: DocumentJoiner.documents
  - sender: FileTypeRouter.image/jpg
    receiver: ImageSourceListJoiner.values
  - sender: FileTypeRouter.image/jpeg
    receiver: ImageSourceListJoiner.values
  - sender: FileTypeRouter.image/png
    receiver: ImageSourceListJoiner.values
  - sender: FileTypeRouter.image/gif
    receiver: ImageSourceListJoiner.values
  - sender: DocumentJoiner.documents
    receiver: Embedder.documents
  - sender: Embedder.documents
    receiver: DocumentWriter.documents
  - sender: ImageSourceListJoiner.values
    receiver: ImageFileToDocument.sources
  - sender: ImageFileToDocument.documents
    receiver: DocumentJoinerForExtraction.documents
  - sender: ContentFilter.short_documents
    receiver: DocumentJoinerForExtraction.documents
  - sender: LLMDocumentContentExtractor.documents
    receiver: DocumentJoiner.documents
  - sender: DocumentJoinerForExtraction.documents
    receiver: FileDownloader.documents
  - sender: FileDownloader.documents
    receiver: LLMDocumentContentExtractor.documents

  inputs:  # Define the inputs for your pipeline
    files:  # These components will receive the files to index as input
    - FileTypeRouter.sources

This is an example of a Visual RAG QA pipeline with GPT-4o that uses the files indexed with the template above to answer queries related to images. It uses both keyword and semantic retrieval to fetch matching documents. For retrieval, it uses textual versions of documents (for images, these are the image captions created during indexing). It then groups all documents resulting from one file based on their metadata using the MetaFieldGroupingRanker, replaces textual documents with actual images, and sends them to the LLM in the prompt.

components:
  BM25Retriever:
    type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: 'Visual-Search-en'
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20
      fuzziness: 0

  Embedder:
    type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
    init_parameters:
      normalize_embeddings: true
      model: BAAI/bge-m3

  EmbeddingRetriever:
    type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          hosts:
          index: 'Visual-Hybrid-Retrieval-GPT-4o-en'
          max_chunk_bytes: 104857600
          embedding_dim: 1024
          return_embedding: false
          method:
          mappings:
          settings:
          create_index: true
          http_auth:
          use_ssl:
          verify_certs:
          timeout:
      top_k: 20

  DocumentJoiner:
    type: haystack.components.joiners.document_joiner.DocumentJoiner
    init_parameters:
      join_mode: concatenate

  Ranker:
    type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
    init_parameters:
      model: BAAI/bge-reranker-v2-m3
      top_k: 5

  MetaFieldGroupingRanker:
    type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
    init_parameters:
      group_by: file_id
      sort_docs_by: split_id

  FileDownloader:
    type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
    init_parameters:
      file_extensions:
      - .pdf
      - .png
      - .jpeg
      - .jpg
      - .gif

  DocumentToImageContent:
    type: haystack.components.converters.image.document_to_image.DocumentToImageContent
    init_parameters:
      detail: auto

  ChatPromptBuilder:
    type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
    init_parameters:
      required_variables: '*'
      template: |
        {%- message role="user" -%}
        Answer the questions briefly and precisely using the images provided.

        Question: {{ question }}

        {%- if image_contents|length > 0 %}
        {%- for img in image_contents -%}
          {{ img | templatize_part }}
        {%- endfor -%}
        {% endif %}
        {%- endmessage -%}

  LLM:
    type: haystack.components.generators.chat.openai.OpenAIChatGenerator
    init_parameters:
      api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": false}
      model: gpt-4o
      generation_kwargs:
        max_tokens: 650
        temperature: 0
        seed: 0

  Adapter:
    init_parameters:
      custom_filters: {}
      output_type: List[str]
      template: '{{ [(messages|last).text] }}'
      unsafe: false
    type: haystack.components.converters.output_adapter.OutputAdapter

  AnswerBuilder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters:
      reference_pattern: acm

connections:  # Defines how the components are connected
- sender: BM25Retriever.documents
  receiver: DocumentJoiner.documents
- sender: EmbeddingRetriever.documents
  receiver: DocumentJoiner.documents
- sender: Embedder.embedding
  receiver: EmbeddingRetriever.query_embedding
- sender: DocumentJoiner.documents
  receiver: Ranker.documents
- sender: Ranker.documents
  receiver: MetaFieldGroupingRanker.documents
- sender: MetaFieldGroupingRanker.documents
  receiver: FileDownloader.documents
- sender: DocumentToImageContent.image_contents
  receiver: ChatPromptBuilder.image_contents
- sender: FileDownloader.documents
  receiver: AnswerBuilder.documents
- sender: FileDownloader.documents
  receiver: DocumentToImageContent.documents
- sender: ChatPromptBuilder.prompt
  receiver: LLM.messages
- sender: LLM.replies
  receiver: Adapter.messages
- sender: Adapter.output
  receiver: AnswerBuilder.replies

inputs:  # Define the inputs for your pipeline
  query:  # These components will receive the query as input
  - "BM25Retriever.query"
  - "ChatPromptBuilder.question"
  - "AnswerBuilder.query"
  - Embedder.text
  - Ranker.query
  filters:  # These components will receive a potential query filter as input
  - "BM25Retriever.filters"
  - "EmbeddingRetriever.filters"
  files:
  - FileDownloader.sources

outputs:  # Defines the output of your pipeline
  documents: "FileDownloader.documents"           # The output of the pipeline is the retrieved documents
  answers: "AnswerBuilder.answers"   # The output of the pipeline is the generated answers

max_runs_per_component: 100

metadata: {}

Example: Image Generation

This system generates images directly from user prompts, so it doesn't need any indexed data. It's important to use an image generation model, such as DALL-E, in the pipeline.

The easiest way to build a pipeline that can generate images is by using the DallE-Image-Generator template. Deploy the pipeline and that's it.

components:
  prompt_builder:
    type: haystack.components.builders.prompt_builder.PromptBuilder
    init_parameters:
      template: '{{query}}'
  dalle_image_generator:
    type: haystack.components.generators.openai_dalle.DALLEImageGenerator
    init_parameters:
      model: dall-e-3
      quality: standard
      size: 1024x1024
      response_format: url
      timeout: 60
  answer_formatter:
    type: haystack.components.converters.output_adapter.OutputAdapter
    init_parameters:
      template: |-
        {% set ns = namespace(doc_string='') %}
        {% set ns.doc_string = ns.doc_string + '## Query:\n' + query + '\n\n' %}
        {% set ns.doc_string = ns.doc_string + '## OpenAIs Revised Prompt:\n' + revised_prompt + '\n\n' %}
        {% set ns.doc_string = ns.doc_string + '![](' + images[0] + ')' + '\n\n' %}
        {% set answer = [ns.doc_string] %}
        {{ answer }}
      output_type: List[str]
  answer_builder:
    type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
    init_parameters: {}

connections:
- sender: prompt_builder.prompt
  receiver: dalle_image_generator.prompt
- sender: dalle_image_generator.revised_prompt
  receiver: answer_formatter.revised_prompt
- sender: dalle_image_generator.images
  receiver: answer_formatter.images
- sender: answer_formatter.output
  receiver: answer_builder.replies
- sender: prompt_builder.prompt
  receiver: answer_builder.prompt

max_runs_per_component: 100

metadata: {}

inputs:
  query:
  - prompt_builder.query
  - answer_formatter.query
  - answer_builder.query

outputs:
  answers: answer_builder.answers

Combining Modalities

Finally, you can build systems that mix different data types. For instance:

  • Audio to image: Accept audio as input, then generate images based on the audio.
  • Image + text: Process images and feed results as context to a text-based query.