Multimodal Systems
Multimodal systems can process, understand, and generate information across multiple types such as text, images, audio, and video. Learn what's possible in deepset AI Platform.
Overview
You can build systems that combine multiple data types and formats. These can range from simple setups (such as transcribing speech to text or generating image captions) to more advanced ones that process and analyze videos. Such systems have a variety of applications. They can give your AI assistants new capabilities but also help people with disabilities.
Types of Multimodal Systems
Below are common types of multimodal systems you can build with deepset using existing components.
Audio-Based Systems
With the deepset AI Platform, you can build speech to text systems that take audio input and return textual answers. The steps to building such systems include:
- Uploading audio files to a deepset workspace.
- Preprocessing audio files with a transcriber component, like
RemoteWhisperTranscriber
, that converts the audio into text documents. - Writing the resulting documents into a document store so that your query pipeline can retrieve them.
- Build a query pipeline that answers questions based on those transcribed documents.
Example
This is an example index that transcribes audio files using RemoteWhisperTranscriber
and writes the transcribed documents into a document store:
components:
file_classifier:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- text/plain
- application/pdf
- audio/wav
splitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: true
language: en
document_embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
normalize_embeddings: true
model: intfloat/e5-base-v2
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
RemoteWhisperTranscriber:
type: haystack.components.audio.whisper_remote.RemoteWhisperTranscriber
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: whisper-1
api_base_url:
organization:
http_client_kwargs:
connections: # Defines how the components are connected
- sender: document_embedder.documents
receiver: writer.documents
- sender: file_classifier.audio/wav
receiver: RemoteWhisperTranscriber.sources
- sender: splitter.documents
receiver: document_embedder.documents
- sender: RemoteWhisperTranscriber.documents
receiver: splitter.documents
inputs: # Define the inputs for your pipeline
files: # This component will receive the files to index as input
- file_classifier.sources
max_runs_per_component: 100
metadata: {}
Image-Based Systems
You can easily create systems that process, analyze, or generate images. Some examples include:
- Visual question answering (ask questions about image content, including scanned documents)
- Image generation (create images from textual descriptions)
- Image analysis or classification (compare, classify, or interpret images)
Working in images often requires specialized models, like DALL-E, for generating images. deepset AI Platform is model-agnostic so you can easily try out different models
To work with images, you may need a special index. deepset offers two index templates designed specifically for visual search that you can use out-of-the-box:
- Image-to-Text: Uses the Azure's Document Intelligence OCR service to extract text from PDF files. Use this template if you want to run OCR on your PDFs.
- Visual Search: Processes images by extracting their text descriptions. It also processes PDF files by splitting each PDF by page and checking each page for text content that can be extracted.
- If the is not text content, the page is sent to a vision LLM that extracts its content. The extracted content is then send to the Embedder and indexed into the document store.
- If the is text content, it's sent to the Embedder and indexed into the document store.
Both templates are available for English and German.
Once your data is indexed, you can build a query pipeline that prompts an LLM to operate on the images
Example: Visual Question Answering
Here's how you could build a system to answer questions about images:
- First, create an index using the Visual Search template.
- Then, build a query pipeline using one of the Visual RAG Question Answering templates.
This is an example index that prepares files for visual search. It uses a vision LLM to extract the content of PDF files. The documents resulting from PDFs are then split and written into the OpenSearch document store. It doesn't split images.
components:
FileTypeRouter:
type: haystack.components.routers.file_type_router.FileTypeRouter
init_parameters:
mime_types:
- application/pdf
- image/jpg
- image/jpeg
- image/png
- image/gif
PDFConverter:
type: haystack.components.converters.pdfminer.PDFMinerToDocument
init_parameters:
line_overlap: 0.5
char_margin: 2
line_margin: 0.5
word_margin: 0.1
boxes_flow: 0.5
detect_vertical: true
all_texts: false
store_full_path: false
PageSplitter:
type: haystack.components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
split_by: page
split_length: 1
split_overlap: 0
respect_sentence_boundary: false
language: en
use_split_rules: false
extend_abbreviations: false
ContentFilter:
type: haystack.components.routers.document_length_router.DocumentLengthRouter
init_parameters:
threshold: 1
ImageSourceListJoiner:
type: haystack.components.joiners.list_joiner.ListJoiner
init_parameters:
list_type_: List[Union[str, pathlib.Path, haystack.dataclasses.ByteStream]]
ImageFileToDocument:
type: haystack.components.converters.image.file_to_document.ImageFileToDocument
init_parameters:
store_full_path: true
DocumentJoinerForExtraction:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
FileDownloader:
type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
init_parameters:
file_extensions:
sources_target_type: str
max_cache_size: 100
LLMDocumentContentExtractor:
type: haystack.components.extractors.image.llm_document_content_extractor.LLMDocumentContentExtractor
init_parameters:
chat_generator:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
model: gpt-4o
timeout: 120
generation_kwargs:
max_tokens: 16384
temperature: 0
prompt: |
You are part of an information extraction pipeline that extracts the content of image-based documents.
Extract the content from the provided image.
You need to extract the content exactly.
Format everything as markdown.
Make sure to retain the reading order of the document.
**Headers- and Footers**
Remove repeating page headers or footers that disrupt the reading order.
Place letter heads that appear at the side of a document at the top of the page.
**Visual Elements**
Do not extract figures, drawings, maps, graphs or any other visual elements.
Instead, add a caption that describes briefly what you see in the visual element.
You must describe each visual element.
If you only see a visual element without other content, you must describe this visual element.
Enclose each image caption with [img-caption][/img-caption]
**Tables**
Make sure to format the table in markdown.
Add a short caption below the table that describes the table's content.
Enclose each table caption with [table-caption][/table-caption].
The caption must be placed below the extracted table.
**Forms**
Reproduce checkbox selections with markdown.
Go ahead and extract!
Document:
file_path_meta_field: file_path
root_path:
detail:
size:
raise_on_failure: true
max_workers: 4
DocumentJoiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
Embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.document_embedder.DeepsetNvidiaDocumentEmbedder
init_parameters:
model: BAAI/bge-m3
normalize_embeddings: true
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 1024
policy: OVERWRITE
connections: # Defines how the components are connected
- sender: FileTypeRouter.application/pdf
receiver: PDFConverter.sources
- sender: PDFConverter.documents
receiver: PageSplitter.documents
- sender: PageSplitter.documents
receiver: ContentFilter.documents
- sender: ContentFilter.long_documents
receiver: DocumentJoiner.documents
- sender: FileTypeRouter.image/jpg
receiver: ImageSourceListJoiner.values
- sender: FileTypeRouter.image/jpeg
receiver: ImageSourceListJoiner.values
- sender: FileTypeRouter.image/png
receiver: ImageSourceListJoiner.values
- sender: FileTypeRouter.image/gif
receiver: ImageSourceListJoiner.values
- sender: DocumentJoiner.documents
receiver: Embedder.documents
- sender: Embedder.documents
receiver: DocumentWriter.documents
- sender: ImageSourceListJoiner.values
receiver: ImageFileToDocument.sources
- sender: ImageFileToDocument.documents
receiver: DocumentJoinerForExtraction.documents
- sender: ContentFilter.short_documents
receiver: DocumentJoinerForExtraction.documents
- sender: LLMDocumentContentExtractor.documents
receiver: DocumentJoiner.documents
- sender: DocumentJoinerForExtraction.documents
receiver: FileDownloader.documents
- sender: FileDownloader.documents
receiver: LLMDocumentContentExtractor.documents
inputs: # Define the inputs for your pipeline
files: # These components will receive the files to index as input
- FileTypeRouter.sources
This is an example of a Visual RAG QA pipeline with GPT-4o that uses the files indexed with the template above to answer queries related to images. It uses both keyword and semantic retrieval to fetch matching documents. For retrieval, it uses textual versions of documents (for images, these are the image captions created during indexing). It then groups all documents resulting from one file based on their metadata using the MetaFieldGroupingRanker
, replaces textual documents with actual images, and sends them to the LLM in the prompt.
components:
BM25Retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Visual-Search-en'
max_chunk_bytes: 104857600
embedding_dim: 1024
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
top_k: 20
fuzziness: 0
Embedder:
type: deepset_cloud_custom_nodes.embedders.nvidia.text_embedder.DeepsetNvidiaTextEmbedder
init_parameters:
normalize_embeddings: true
model: BAAI/bge-m3
EmbeddingRetriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Visual-Hybrid-Retrieval-GPT-4o-en'
max_chunk_bytes: 104857600
embedding_dim: 1024
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
top_k: 20
DocumentJoiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: concatenate
Ranker:
type: deepset_cloud_custom_nodes.rankers.nvidia.ranker.DeepsetNvidiaRanker
init_parameters:
model: BAAI/bge-reranker-v2-m3
top_k: 5
MetaFieldGroupingRanker:
type: haystack.components.rankers.meta_field_grouping_ranker.MetaFieldGroupingRanker
init_parameters:
group_by: file_id
sort_docs_by: split_id
FileDownloader:
type: deepset_cloud_custom_nodes.augmenters.deepset_file_downloader.DeepsetFileDownloader
init_parameters:
file_extensions:
- .pdf
- .png
- .jpeg
- .jpg
- .gif
DocumentToImageContent:
type: haystack.components.converters.image.document_to_image.DocumentToImageContent
init_parameters:
detail: auto
ChatPromptBuilder:
type: haystack.components.builders.chat_prompt_builder.ChatPromptBuilder
init_parameters:
required_variables: '*'
template: |
{%- message role="user" -%}
Answer the questions briefly and precisely using the images provided.
Question: {{ question }}
{%- if image_contents|length > 0 %}
{%- for img in image_contents -%}
{{ img | templatize_part }}
{%- endfor -%}
{% endif %}
{%- endmessage -%}
LLM:
type: haystack.components.generators.chat.openai.OpenAIChatGenerator
init_parameters:
api_key: {"type": "env_var", "env_vars": ["OPENAI_API_KEY"], "strict": false}
model: gpt-4o
generation_kwargs:
max_tokens: 650
temperature: 0
seed: 0
Adapter:
init_parameters:
custom_filters: {}
output_type: List[str]
template: '{{ [(messages|last).text] }}'
unsafe: false
type: haystack.components.converters.output_adapter.OutputAdapter
AnswerBuilder:
type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
init_parameters:
reference_pattern: acm
connections: # Defines how the components are connected
- sender: BM25Retriever.documents
receiver: DocumentJoiner.documents
- sender: EmbeddingRetriever.documents
receiver: DocumentJoiner.documents
- sender: Embedder.embedding
receiver: EmbeddingRetriever.query_embedding
- sender: DocumentJoiner.documents
receiver: Ranker.documents
- sender: Ranker.documents
receiver: MetaFieldGroupingRanker.documents
- sender: MetaFieldGroupingRanker.documents
receiver: FileDownloader.documents
- sender: DocumentToImageContent.image_contents
receiver: ChatPromptBuilder.image_contents
- sender: FileDownloader.documents
receiver: AnswerBuilder.documents
- sender: FileDownloader.documents
receiver: DocumentToImageContent.documents
- sender: ChatPromptBuilder.prompt
receiver: LLM.messages
- sender: LLM.replies
receiver: Adapter.messages
- sender: Adapter.output
receiver: AnswerBuilder.replies
inputs: # Define the inputs for your pipeline
query: # These components will receive the query as input
- "BM25Retriever.query"
- "ChatPromptBuilder.question"
- "AnswerBuilder.query"
- Embedder.text
- Ranker.query
filters: # These components will receive a potential query filter as input
- "BM25Retriever.filters"
- "EmbeddingRetriever.filters"
files:
- FileDownloader.sources
outputs: # Defines the output of your pipeline
documents: "FileDownloader.documents" # The output of the pipeline is the retrieved documents
answers: "AnswerBuilder.answers" # The output of the pipeline is the generated answers
max_runs_per_component: 100
metadata: {}
Example: Image Generation
This system generates images directly from user prompts, so it doesn't need any indexed data. It's important to use an image generation model, such as DALL-E, in the pipeline.
The easiest way to build a pipeline that can generate images is by using the DallE-Image-Generator template. Deploy the pipeline and that's it.
components:
prompt_builder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: '{{query}}'
dalle_image_generator:
type: haystack.components.generators.openai_dalle.DALLEImageGenerator
init_parameters:
model: dall-e-3
quality: standard
size: 1024x1024
response_format: url
timeout: 60
answer_formatter:
type: haystack.components.converters.output_adapter.OutputAdapter
init_parameters:
template: |-
{% set ns = namespace(doc_string='') %}
{% set ns.doc_string = ns.doc_string + '## Query:\n' + query + '\n\n' %}
{% set ns.doc_string = ns.doc_string + '## OpenAIs Revised Prompt:\n' + revised_prompt + '\n\n' %}
{% set ns.doc_string = ns.doc_string + '' + '\n\n' %}
{% set answer = [ns.doc_string] %}
{{ answer }}
output_type: List[str]
answer_builder:
type: deepset_cloud_custom_nodes.augmenters.deepset_answer_builder.DeepsetAnswerBuilder
init_parameters: {}
connections:
- sender: prompt_builder.prompt
receiver: dalle_image_generator.prompt
- sender: dalle_image_generator.revised_prompt
receiver: answer_formatter.revised_prompt
- sender: dalle_image_generator.images
receiver: answer_formatter.images
- sender: answer_formatter.output
receiver: answer_builder.replies
- sender: prompt_builder.prompt
receiver: answer_builder.prompt
max_runs_per_component: 100
metadata: {}
inputs:
query:
- prompt_builder.query
- answer_formatter.query
- answer_builder.query
outputs:
answers: answer_builder.answers
Combining Modalities
Finally, you can build systems that mix different data types. For instance:
- Audio to image: Accept audio as input, then generate images based on the audio.
- Image + text: Process images and feed results as context to a text-based query.
Updated 19 days ago