Multimodal AI
Systems that see, hear, and act โ vision models, audio AI, and multimodal safety.
How Vision-Language Models Work
A vision-language model (VLM) combines a visual encoder with a language model: images are converted to token-like embeddings and fed directly into the same context window as text. Understanding this architecture explains why images cost more tokens than they appear to, and why resolution and tiling choices matter in production.
Working with Images in Production
Sending an image to a VLM is trivial; building a production image pipeline that handles validation, preprocessing, output parsing, and failure modes is not. This module covers the full ingestion pipeline from receipt to parsed output, with emphasis on the silent failure modes that catch teams by surprise.
Audio and Speech AI
The audio AI stack spans automatic speech recognition (ASR), text-to-speech (TTS), and the orchestration layer that connects them to language models. This module covers the key components, their production metrics, and the voice AI pipeline pattern that powers real-time conversational applications.
Multimodal Agents
Multimodal agents extend the standard agent loop with perception across images and audio, and with actions that produce visual or spoken output. This module covers GUI agents, vision as a tool call, multimodal memory, and the specific failure modes that multimodal perception introduces into agent systems.
Multimodal Safety
Images and audio introduce attack surfaces that text-only safety systems do not cover: injected instructions inside images, adversarial visual inputs, deepfakes, and PII embedded in non-text modalities. This module covers the threat model for multimodal inputs and the defensive patterns that close the gaps.
Multimodal Evaluation
Evaluating multimodal AI is harder than evaluating text: there is no ground truth for 'describe this image', visual hallucinations are invisible without the source image, and labelling image datasets is expensive. This module covers evaluation approaches by task type, reference datasets, hallucination detection, and how to build a practical multimodal eval pipeline.
Serving Multimodal Models
Serving a vision-language model is not the same as serving a text-only LLM: the vision encoder adds VRAM, image preprocessing adds latency, and variable image sizes complicate batching. This module covers the serving stack for VLMs and audio models, including the VRAM estimation mistakes that cause production OOMs.
The Multimodal Frontier
Multimodal AI is advancing faster than any other part of the field: native multimodality, video understanding, and real-time audio-visual interaction are moving from research to production on a timescale of months. This module covers where the field is heading and, more importantly, what durable knowledge to invest in when specific capabilities become outdated within a year.