🤖 AI Explained
Fast-moving: verify before relying on this 5 min read

Multimodal Agents

Multimodal agents extend the standard agent loop with perception across images and audio, and with actions that produce visual or spoken output. This module covers GUI agents, vision as a tool call, multimodal memory, and the specific failure modes that multimodal perception introduces into agent systems.

Layer 1: Surface

A standard agent loop has the model read text, reason, and call text-based tools. A multimodal agent has additional perception channels, it can see screenshots, read images, hear audio, and potentially additional output channels: speaking, generating images, clicking UI elements. This broader set of inputs and outputs makes the agent more capable, but each new modality introduces new failure modes.

The most practically important class of multimodal agent today is the GUI agent (also called a computer use agent): an agent that interacts with software user interfaces by seeing the screen, deciding what to do, and executing mouse and keyboard actions. The loop is: take screenshot → send to VLM → receive action instruction (click [x,y], type “text”, scroll down) → execute → take new screenshot → repeat.

Grounding is the challenge that unifies most multimodal agent problems: linking a model’s language-level output (“click the Submit button”) to the specific physical location in an image or timestamp in audio where that entity actually exists. Grounding fails when the model hallucinates the presence of an element that doesn’t exist, or when it correctly identifies an element but localises it to the wrong position.

Visual hallucination, describing objects, text, or UI elements that are not present in the image, is more dangerous in agents than in assistants, because an agent acts on what it perceives. If the model says “I see the OK button” when it isn’t there, the agent may click the wrong thing or enter an infinite wait loop.

Why it matters

GUI agents and multimodal automation are among the highest-leverage applications of AI in 2025–2026, but they are also among the highest-failure-rate when deployed carelessly. The surface area for silent failures is larger than for text-only agents: a text agent that misunderstands a tool call typically produces an error; a GUI agent that clicks the wrong button may complete an irreversible action without any error signal.

Production Gotcha

Common Gotcha: Multimodal agents that interact with GUIs are highly sensitive to UI changes: a redesign, font size change, or new modal dialog can cause the agent to fail silently by clicking the wrong element or not finding the expected UI state. Version-pin the UI or build explicit UI state validation into the agent loop.

GUI agents are brittle to UI changes because they localise elements visually rather than structurally (e.g., by DOM element ID). A button that moves 10 pixels, a dialog that appears on some user accounts but not others, or a responsive layout change on a different browser width can all cause the agent to click the wrong target with high confidence. Version-pinning the UI (a dedicated test environment with a frozen UI version) and adding explicit state assertions (“verify the expected dialog is visible before proceeding”) substantially reduces this failure mode.


Layer 2: Guided

Vision as a tool call

The cleanest way to add visual perception to an agent is to wrap image analysis as a tool. The agent calls the tool when it needs to see something; the tool returns structured data that the agent can reason about.

from dataclasses import dataclass
from typing import Any, Optional
import base64


@dataclass
class ScreenElement:
    element_type: str          # "button", "text_field", "label", "link", etc.
    label: str                 # visible text or accessible name
    x: int                     # approximate x center coordinate
    y: int                     # approximate y center coordinate
    confidence: str            # "high", "medium", "low"


@dataclass
class ScreenAnalysis:
    description: str
    elements: list[ScreenElement]
    page_state: str            # "login", "dashboard", "error", "loading", etc.
    action_suggestions: list[str]


SCREEN_ANALYSIS_PROMPT = """
Analyse this screenshot and return a JSON object with exactly this structure:
{
  "description": "one sentence describing what is shown",
  "page_state": "one of: login, form, dashboard, error, loading, confirmation, other",
  "elements": [
    {
      "element_type": "button|text_field|label|link|checkbox|dropdown|other",
      "label": "the visible text or accessible name",
      "x": integer_x_coordinate,
      "y": integer_y_coordinate,
      "confidence": "high|medium|low"
    }
  ],
  "action_suggestions": ["list of possible next actions as strings"]
}
Only include interactive elements in 'elements'. Return only the JSON object.
"""


def analyse_screen(screenshot_bytes: bytes) -> ScreenAnalysis:
    """
    Tool: analyse a screenshot and return structured UI element information.
    This function is exposed to the agent as a tool call.
    """
    import json
    b64 = base64.b64encode(screenshot_bytes).decode("utf-8")

    response = llm.chat(
        model="frontier",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": b64},
                    "detail": "high",
                },
                {"type": "text", "text": SCREEN_ANALYSIS_PROMPT},
            ],
        }],
    )

    text = response.text.strip()
    if text.startswith("```"):
        lines = text.split("\n")
        text = "\n".join(lines[1:-1] if lines[-1] == "```" else lines[1:])

    data = json.loads(text)
    elements = [ScreenElement(**e) for e in data.get("elements", [])]
    return ScreenAnalysis(
        description=data["description"],
        elements=elements,
        page_state=data["page_state"],
        action_suggestions=data.get("action_suggestions", []),
    )

The GUI agent loop

from dataclasses import dataclass, field
from enum import Enum


class ActionType(str, Enum):
    CLICK = "click"
    TYPE = "type"
    SCROLL = "scroll"
    WAIT = "wait"
    DONE = "done"
    FAIL = "fail"


@dataclass
class AgentAction:
    action_type: ActionType
    x: Optional[int] = None         # for click, scroll
    y: Optional[int] = None         # for click, scroll
    text: Optional[str] = None      # for type
    direction: Optional[str] = None # for scroll: "up" or "down"
    reason: str = ""


@dataclass
class AgentStep:
    step_num: int
    screenshot_description: str
    action_taken: AgentAction
    outcome: str


def decide_next_action(
    task: str,
    screen: ScreenAnalysis,
    history: list[AgentStep],
) -> AgentAction:
    """
    Ask the LLM what to do next given the current screen state.
    The model receives the task, the screen analysis, and action history.
    """
    history_text = "\n".join(
        f"Step {s.step_num}: {s.action_taken.action_type.value} — {s.action_taken.reason} → {s.outcome}"
        for s in history[-5:]   # last 5 steps to keep context manageable
    )
    elements_text = "\n".join(
        f"  [{e.element_type}] '{e.label}' at ({e.x},{e.y}) confidence={e.confidence}"
        for e in screen.elements
    )

    prompt = f"""
Task: {task}

Current screen: {screen.description}
Page state: {screen.page_state}

Interactive elements:
{elements_text}

Recent action history:
{history_text if history_text else "No actions taken yet."}

What is the single best next action to make progress toward the task?
Respond with a JSON object:
{{
  "action_type": "click|type|scroll|wait|done|fail",
  "x": integer_or_null,
  "y": integer_or_null,
  "text": "string_or_null",
  "direction": "up|down|null",
  "reason": "one sentence explaining why"
}}
If the task is complete, use "done". If it is impossible to proceed, use "fail".
Return only the JSON object.
"""
    import json
    response = llm.chat(
        model="frontier",
        messages=[{"role": "user", "content": prompt}],
    )
    data = json.loads(response.text.strip())
    return AgentAction(
        action_type=ActionType(data["action_type"]),
        x=data.get("x"),
        y=data.get("y"),
        text=data.get("text"),
        direction=data.get("direction"),
        reason=data.get("reason", ""),
    )


def run_gui_agent(
    task: str,
    take_screenshot: callable,    # () -> bytes
    execute_action: callable,     # (AgentAction) -> str (outcome description)
    max_steps: int = 20,
) -> dict:
    """
    Run the GUI agent loop until task completion, failure, or step limit.
    take_screenshot: function that returns current screen as PNG bytes
    execute_action: function that performs the action and returns an outcome string
    """
    history: list[AgentStep] = []

    for step_num in range(1, max_steps + 1):
        screenshot = take_screenshot()
        screen = analyse_screen(screenshot)

        action = decide_next_action(task, screen, history)

        if action.action_type == ActionType.DONE:
            return {"status": "success", "steps": step_num, "history": history}

        if action.action_type == ActionType.FAIL:
            return {"status": "failed", "reason": action.reason, "steps": step_num}

        outcome = execute_action(action)
        history.append(AgentStep(
            step_num=step_num,
            screenshot_description=screen.description,
            action_taken=action,
            outcome=outcome,
        ))

    return {"status": "max_steps_reached", "steps": max_steps, "history": history}

UI state validation

Build explicit assertions into the agent loop to catch UI surprises before they cause wrong actions.

def assert_ui_state(
    screen: ScreenAnalysis,
    expected_state: str,
    required_elements: list[str],
) -> tuple[bool, str]:
    """
    Validate that the UI is in the expected state before proceeding.
    Returns (is_valid, error_message).
    """
    if screen.page_state != expected_state:
        return False, (
            f"Expected page state '{expected_state}', got '{screen.page_state}'. "
            f"Screen description: {screen.description}"
        )

    element_labels = {e.label.lower() for e in screen.elements}
    missing = [req for req in required_elements if req.lower() not in element_labels]
    if missing:
        return False, f"Required UI elements not found: {missing}"

    return True, ""

Layer 3: Deep Dive

Grounding: linking language to spatial coordinates

Grounding is the process of mapping a high-level language reference (“the blue Submit button”) to a specific location in the image. This is harder than it appears:

  • Referring expression comprehension: the model must identify which element the description refers to, not just detect that a button exists.
  • Coordinate precision: clicking at the right pixel matters. An error of even 20px can result in clicking the wrong element if elements are densely packed.
  • Out-of-viewport elements: if the target element has been scrolled off-screen, it does not appear in the screenshot. The model must recognise this and scroll before clicking.

Current VLMs produce bounding box coordinates that have meaningful errors for small targets, densely packed UIs, and scaled/zoomed displays. Strategies to improve grounding accuracy:

StrategyHow it helpsTradeoff
Crop and zoomCrop the region of interest before sending to modelRequires knowing the region in advance
Set-of-mark promptingOverlay numbered labels on each detected element, ask model to name the numberRequires a preliminary element detection step
Accessibility treeParse the DOM/accessibility API instead of relying on visionOnly works for accessible applications
Ensemble groundingAsk model multiple times, take modal coordinateHigher latency, higher cost

Multimodal memory

Agents that operate over long sessions need to store and retrieve information from prior steps. For multimodal agents, memory includes both text (summaries of observations, action history) and visual references (screenshots of key states, extracted image regions).

@dataclass
class MultimodalMemoryEntry:
    entry_id: str
    step_num: int
    text_summary: str              # text description of what was observed
    screenshot_thumbnail: bytes    # compressed thumbnail for later reference
    embedding: list[float]         # embedding of the text summary for retrieval
    tags: list[str]                # e.g., ["login_screen", "error", "confirmation"]


def store_observation(
    screen: ScreenAnalysis,
    action: AgentAction,
    step_num: int,
    memory_store: list[MultimodalMemoryEntry],
) -> None:
    """Store a compressed record of this step for later retrieval."""
    summary = f"Step {step_num}: Saw {screen.description}. Took action: {action.reason}"
    embedding = llm.embed(model="embedding", input=summary).embedding

    memory_store.append(MultimodalMemoryEntry(
        entry_id=f"step_{step_num}",
        step_num=step_num,
        text_summary=summary,
        screenshot_thumbnail=b"",   # in production: compress screenshot to thumbnail
        embedding=embedding,
        tags=[screen.page_state],
    ))

Failure taxonomy for multimodal agents

Failure modeDescriptionDetection strategy
Visual hallucinationModel describes elements not present in the screenshotCross-reference with accessibility tree or DOM inspection
Stale screenshotAgent acts on old screenshot while UI has changedAdd timestamp validation; re-capture before each action
Coordinate driftClicks land near but not on the targetLog click vs intended target; verify element change after click
Ambiguous instructions”Click OK” when multiple OK buttons existRequire unique element identification
Infinite loopAgent keeps trying the same failing actionDetect repeated action-state pairs; abort with failure
Resolution sensitivitySmall text, icons unreadable at screenshot resolutionTest at the actual display resolution and DPI

Further reading

✏ Suggest an edit on GitHub

Multimodal Agents: Check your understanding

Q1

A GUI agent is tasked with submitting a form. It successfully identifies and fills the form fields in testing. After a UI redesign that changes button labels and layout, the agent clicks the wrong element and silently submits incomplete data. What architectural pattern would have caught this before it reached production?

Q2

An agent uses vision as a tool call: it submits an image to a VLM and receives structured JSON describing the image contents, which feeds into subsequent reasoning steps. The VLM returns a description that includes objects not present in the image. What failure mode is this, and what is the correct mitigation?

Q3

A multimodal agent stores image embeddings in a vector store alongside text embeddings for memory retrieval. At query time, it retrieves memories using a text query. What problem does mixed-modality memory retrieval introduce?

Q4

A computer-use agent successfully completes a task in a staging environment where the UI is static. In production, dynamic content (loading spinners, async-rendered elements) causes the agent to act on elements that have not fully loaded. What is the correct fix?

Q5

What is the key architectural difference between 'vision as a tool call' and 'native multimodal reasoning', and when does the distinction matter?