AI & Computational Science

Dissecting model behavior through agent trajectories

AI Insight

This research identifies and addresses the "intent-execution gap" in AI agents, which occurs when there is a mismatch between what an AI model intends to do and what the agent harness actually executes. The authors developed Simple Strands Agent (SSA), a customizable harness that successfully reproduces or improves upon reported performance across multiple model families (Claude, Gemini, GPT, Grok, Qwen) on standard benchmarks. By analyzing 138,000 agent trajectories, they reveal that despite similar overall success rates, different AI models exhibit distinct problem-solving behaviors, with variations in edit frequency, testing activity, and phase transitions during autonomous problem-solving tasks.


This work highlights that AI agent performance depends not just on model quality but on proper alignment between the model and its execution framework. The findings provide a framework for better evaluating and deploying AI agents, moving beyond simple pass/fail metrics to understand how different models approach complex tasks, which could inform more effective agent design and deployment strategies.


arXiv:2606.17454v2 Announce Type: replace
Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model’s full capabilities from translating into agent performance. We formalize this as the `intent-execution’ gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent’ (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

Source: Dissecting model behavior through agent trajectories