← Back to Homepage

Agent Evolving

智能体演化与学习研究

📊 50 Papers 📅 Updated: 2026-05-14
1
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz et al. (13 authors)
📅 2026-05-13
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end...
2
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Alberto G. Rodríguez Salgado
📅 2026-05-13
Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior...
3
Harnessing Agentic Evolution
Jiayi Zhang, Yongfeng Gu, Jianhao Ruan et al. (13 authors)
📅 2026-05-13
Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but...
4
EconAI: Dynamic Persona Evolution and Memory-Aware Agents in Evolving Economic Environments
Annie Liu, Zane Cao, Lang Chen et al. (5 authors)
📅 2026-05-13
The integration of large language models (LLMs) in economic simulations has significantly enhanced agent-based modeling, yet existing frameworks struggle to capture the interplay between short-term optimization and long-term strategic planning. Conventional approaches rely on static data-driven predictions, failing to incorporate adaptive behaviors influenced by economic sentiment, market...
5
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Trung Nguyen Quang, Yiming Gao, Fanyi Pu et al. (6 authors)
📅 2026-05-13
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's...
6
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
Yitian Yang, Yiqun Duan, Linghan Huang et al. (7 authors)
📅 2026-05-13
Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges...
7
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
Hongji Pu, Xinyuan Song, Liang Zhao
📅 2026-05-13
Large language model agents increasingly rely on skill libraries for multi-step tasks, yet these libraries can accumulate persistent defects as skills are added, reused, patched, and linked to changing dependencies. We call this failure mode skill technical debt: library-level defects that may not break a single skill locally but can harm future retrieval, composition, and execution. Existing...
8
Identifying AI Web Scrapers Using Canary Tokens
Steven Seiden, Triss Ren, Caroline Zhang et al. (6 authors)
📅 2026-05-13
From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they...
9
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
Seokha Moon, Minseung Lee, Joon Seo et al. (5 authors)
📅 2026-05-13
End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to...
10
How to Interpret Agent Behavior
Jie Gao, Kaiser Sun, Jen-tse Huang et al. (11 authors)
📅 2026-05-13
Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in...
11
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
Peng Kang, Bixuan Li, Xiaoya Huang et al. (8 authors)
📅 2026-05-13
The Materials Genome Initiative catalyzed the proliferation of centralized platforms--SaaS, PaaS, and IaaS--that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical "last mile"...
12
Unweighted ranking for value-based decision making with uncertainty
Aarón López García, Natalia Criado, Jose Such
📅 2026-05-13
As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this...
13
Position: Assistive Agents Need Accessibility Alignment
Jie Hu, Changyuan Yan, Yu Zheng et al. (5 authors)
📅 2026-05-13
Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or...
14
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
Asim Osman, Sasha Abramowitz, Mark Bergh et al. (16 authors)
📅 2026-05-13
Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces,...
15
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Chengzhi Shen, Weixiang Shen, Tobias Susetzky et al. (11 authors)
📅 2026-05-13
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal...
16
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
Jiabei Liu, Wenyu Mao, Junfei Tan et al. (7 authors)
📅 2026-05-13
Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary...
17
MMSkills: Towards Multimodal Skills for General Visual Agents
Kangning Zhang, Shuai Shao, Qingyao Li et al. (11 authors)
📅 2026-05-13
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual...
18
Cognifold: Always-On Proactive Memory via Cognitive Folding
Suli Wang, Yiqun Duan, Yu Deng et al. (6 authors)
📅 2026-05-13
Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into...
19
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
Zabir Al Nazi, Shubhashis Roy Dipta
📅 2026-05-13
Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether...
20
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
Liangtian Liu, Zeyuan Wang, Ziyu Li et al. (11 authors)
📅 2026-05-13
The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or...
21
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
Mingzhe Huang, Weijun Wang, Xin Ding et al. (10 authors)
📅 2026-05-13
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous...
22
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
Hailin Zhong, Shengxin Zhu
📅 2026-05-13
Foundation models have transformed automated code generation, yet autonomous software-engineering agents remain unreliable in realistic development settings. The dominant explanation locates this gap in model capability. We propose a different locus: software-engineering capability emerges from a model-harness-environment system, in which a runtime substrate -- the harness -- mediates how a...
23
Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin
Markus Wenzel, Tobias Strapatsas, Jessika Kress et al. (6 authors)
📅 2026-05-13
Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes,...
24
Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning
Qinchuan Cheng, Zhantao Gong, Pengzhan Sun et al. (6 authors)
📅 2026-05-13
Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics,...
25
What Limits Vision-and-Language Navigation ?
Yunheng Wang, Yuetong Fang, Taowen Wang et al. (12 authors)
📅 2026-05-13
Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling...
26
IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
Joy Bose
📅 2026-05-13
Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge...
27
Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention
Yuanzhe Wang, Tian Zhi, Zihang Wei et al. (11 authors)
📅 2026-05-13
Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality...
28
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
Tom Zehle
📅 2026-05-13
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these...
29
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
Yucheng Guo, Yongjian Guo, Zhong Guan et al. (12 authors)
📅 2026-05-13
The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth...
30
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
Xiao Liu, Nayu Liu, Junnan Zhu et al. (9 authors)
📅 2026-05-13
Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for...
31
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
Hanwen Zhang, Dusit Niyato, Wei Zhang et al. (5 authors)
📅 2026-05-13
In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile,...
32
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
Hao Zhou, Tiru Wu, Yan Jiang et al. (6 authors)
📅 2026-05-13
Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus...
33
Decoupled Planning for Multiple Omega-Regular Objectives
Guy Avni, Thomas A. Henzinger, Kaushik Mallik et al. (5 authors)
📅 2026-05-13
We study the problem of generating paths on a graph that satisfy a collection of ω-regular objectives. We propose a decoupled framework in which each objective is assigned to an independent agent that selects a local policy, while a scheduler -- oblivious to the graph and objective -- dynamically composes these policies into a single path. We ask when such a composition satisfies all objectives,...
34
When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling
Ziqi Wang, Yuhao Yang, Zhiwei Ling et al. (5 authors)
📅 2026-05-13
Recent advances in agent and multi-agent systems have shown strong performance on tool use, reasoning, and collaborative tasks. However, existing benchmarks mostly evaluate task completion in weakly coupled environments, and provide limited support for studying coordination in shared, dynamically evolving systems with hierarchy and coupled constraints. This leaves an important question...
35
Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications
Maxwell Standen, Junae Kim, Claudia Szabo
📅 2026-05-13
Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to...
36
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
Zixing Lei, Changxing Liu, Yichen Xiong et al. (8 authors)
📅 2026-05-13
Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools...
37
A Multi-Agent Orchestration Framework for Venture Capital Due Diligence
Grigorios Alexandrou, Katerina Pramatari
📅 2026-05-13
We present a fully automated multi-agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event-driven orchestration architecture, combining Large Language Models (LLMs) with real-time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that...
38
Counterfactual Reasoning for Causal Responsibility Attribution in Probabilistic Multi-Agent Systems
Chunyan Mu, Muhammad Najib
📅 2026-05-13
Responsibility allocation -- determining the extent to which agents are accountable for outcomes -- is a fundamental challenge in the design and analysis of multi-agent systems. In this work, we model such systems as concurrent stochastic multi-player games and introduce a notion of retrospective (backward) counterfactual responsibility, which quantifies an agent's accountability for...
39
An Agentic LLM-Based Framework for Population-Scale Mental Health Screening
Giuliano Lorenzoni, Paulo Alencar, Donald Cowan
📅 2026-05-13
Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured...
40
No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills
Ying Li, Hongbo Wen, Yanju Chen et al. (6 authors)
📅 2026-05-13
LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's...
41
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
Yuxin Liu, Ziang Ye, Yueqing Sun et al. (9 authors)
📅 2026-05-13
Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure...
42
Conveyor Parcel Routing with Order-Contiguous Arrivals
Takuro Kato, Keisuke Okumura
📅 2026-05-13
In warehouse logistics, parcels released from the outfeed of an automated storage system must be routed through conveyor networks to workstations. Beyond collision avoidance, practical operations impose an additional requirement of order-contiguous arrivals: at each delivery point, parcels belonging to the same order must arrive as a consecutive block in the arrival sequence to reduce downstream...
43
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Adarsh Kumarappan, Ananya Mujoo
📅 2026-05-13
LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct....
44
Useful Memories Become Faulty When Continuously Updated by LLMs
Dylan Zhang, Yanshan Lin, Zhengkun Wu et al. (7 authors)
📅 2026-05-13
Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new...
45
Position: Agentic AI System Is a Foreseeable Pathway to AGI
Junwei Liao, Shuai Li, Muning Wen et al. (5 authors)
📅 2026-05-13
Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic...
46
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Chao Hao, Jun Xu, Ji Du et al. (9 authors)
📅 2026-05-13
Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation...
47
When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
Young Hyun Cho, Will Wei Sun
📅 2026-05-13
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models or...
48
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Priyam Sahoo, Gaurav Mittal, Xiaomin Li et al. (7 authors)
📅 2026-05-13
Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic trial-and-error process as equivalent. We show that this equivalence is empirically false. We evaluate 2,614 OpenHands trajectories from eight model backends on 60 SWE-bench Verified tasks. Of these, 47 have...
49
Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue
Vardhan Dongre, Dilek Hakkani-Tür
📅 2026-05-13
Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations...
50
SHM-Agents: A Generalist-Specialist Integrated Agent System for Structural Health Monitoring
Yuequan Bao, Xing Li, Huabin Sun et al. (6 authors)
📅 2026-05-13
Artificial intelligence is increasingly used to simplify complex tasks. In engineering applications of structural health monitoring (SHM), existing specialized algorithms, while effective, often face high implementation barriers, limited interoperability and complex training procedures. To overcome these challenges, this paper proposes SHM-Agents, a generalist-specialist agent system that...