← Back to Homepage

Agent Evolving

智能体演化与学习研究

📊 50 Papers 📅 Updated: 2026-03-18
1
Internalizing Agency from Reflective Experience
Rui Ge, Yichao Fu, Yuyang Qian et al. (7 authors)
📅 2026-03-17
Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they...
2
Learning to Present: Inverse Specification Rewards for Agentic Slide Generation
Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam
📅 2026-03-17
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system...
3
Anticipatory Planning for Multimodal AI Agents
Yongyuan Liang, Shijie Zhou, Yu Gu et al. (9 authors)
📅 2026-03-17
Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that...
4
Nonstandard Errors in AI Agents
Ruijiang Gao, Steven Chong Xiao
📅 2026-03-17
We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent...
5
Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Caglar Yildirim
📅 2026-03-17
Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic...
6
IQuest-Coder-V1 Technical Report
Jian Yang, Wei Zhang, Shawn Guo et al. (38 authors)
📅 2026-03-17
In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with...
7
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Jun Liu, Pu Zhao, Zhenglun Kong et al. (15 authors)
📅 2026-03-17
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions,...
8
Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
Jiawei Mao, Hardy Chen, Haoqin Tu et al. (10 authors)
📅 2026-03-17
Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use...
9
When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education
Eason Chen, Ce Guan, Ahmed Elshafiey et al. (8 authors)
📅 2026-03-17
The AIED community envisions AI evolving "from tools to teammates," yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of...
10
What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline
Benoît Alcaraz
📅 2026-03-17
In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino'', this thesis proposes a pipeline...
11
Runtime Governance for AI Agents: Policies on Paths
Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy
📅 2026-03-17
AI agents -- systems that plan, reason, and act using large language models -- produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data-breach, reputational and other costs associated with running agents. We argue that the...
12
Malicious Or Not: Adding Repository Context to Agent Skill Classification
Florian Holzbauer, David Schmidt, Gabriel Gegenhuber et al. (5 authors)
📅 2026-03-17
Agent skills extend local AI agents, such as Claude Code or Open Claw, with additional functionality, and their popularity has led to the emergence of dedicated skill marketplaces, similar to app stores for mobile applications. Simultaneously, automated skill scanners were introduced, analyzing the skill description available in SKILL.md, to verify their benign behavior. The results for...
13
DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis
Lei Wang, Min Huang, Eduard Dragut
📅 2026-03-17
Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored. In this work, we introduce DanceHA, a multi-agent framework...
14
Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems
Marios Aristodemou, Yasaman Omid, Sangarapillai Lambotharan et al. (5 authors)
📅 2026-03-17
The integration of satellite communication networks with next-generation (NG) technologies is a promising approach towards global connectivity. However, the quality of services is highly dependant on the availability of accurate channel state information (CSI). Channel estimation in satellite communications is challenging due to the high propagation delay between terrestrial users and satellites,...
15
RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
Linghua Zhang, Jun Wang, Jingtong Wu et al. (4 authors)
📅 2026-03-17
Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial...
16
TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas
Ai Jian, Xiaoyun Zhang, Wanrou Du et al. (8 authors)
📅 2026-03-17
Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in...
17
Visual Distraction Undermines Moral Reasoning in Vision-Language Models
Xinyi Yang, Chenheng Xu, Weijun Hong et al. (7 authors)
📅 2026-03-17
Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack...
18
PlotTwist: A Creative Plot Generation Framework with Small Language Models
Abhinav Thorat, Ravi Kolla, Jyotin Goel et al. (4 authors)
📅 2026-03-17
Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized...
19
Fanar 2.0: Arabic Generative AI Stack
FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad et al. (37 authors)
📅 2026-03-17
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic...
20
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment
Qinhong Lin, Ruitao Feng, Yinglun Feng et al. (10 authors)
📅 2026-03-17
We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability...
21
Explainable machine learning workflows for radio astronomical data processing
S. Yatawatta, A. Ahmadi, B. Asabere et al. (6 authors)
📅 2026-03-17
Radio astronomy relies heavily on efficient and accurate processing pipelines to deliver science ready data. With the increasing data flow of modern radio telescopes, manual configuration of such data processing pipelines is infeasible. Machine learning (ML) is already emerging as a viable solution for automating data processing pipelines. However, almost all existing ML enabled pipelines are of...
22
Learning to Predict, Discover, and Reason in High-Dimensional Discrete Event Sequences
Hugo Math
📅 2026-03-17
Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle's subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean...
23
VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
Zhengbo Zhang, Jinbo Su, Zhaowen Zhou et al. (17 authors)
📅 2026-03-17
The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new...
24
Adaptive Theory of Mind for LLM-based Multi-Agent Coordination
Chunjiang Mu, Ya Zeng, Qiaosheng Zhang et al. (9 authors)
📅 2026-03-17
Theory of Mind (ToM) refers to the ability to reason about others' mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM...
25
Visual Prompt Discovery via Semantic Exploration
Jaechang Kim, Yotaro Shimose, Zhao Wang et al. (6 authors)
📅 2026-03-17
LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating...
26
Generative AI for Quantum Circuits and Quantum Code: A Technical Review and Taxonomy
Juhani Merilehto
📅 2026-03-17
We review thirteen generative systems and five supporting datasets for quantum circuit and quantum code generation, identified through a structured scoping review of Hugging Face, arXiv, and provenance tracing (January-February 2026). We organize the field along two axes: artifact type (Qiskit code, OpenQASM programs, circuit graphs); crossed with training regime (supervised fine-tuning,...
27
CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation
Gengxin Sun, Ruihao Yu, Liangyi Yin et al. (6 authors)
📅 2026-03-17
Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized...
28
Proactive Rejection and Grounded Execution: A Dual-Stage Intent Analysis Paradigm for Safe and Efficient AIoT Smart Homes
Xinxin Jin, Zhengwei Ni, Zhengguo Sheng et al. (4 authors)
📅 2026-03-17
As Large Language Models (LLMs) transition from information providers to embodied agents in the Internet of Things (IoT), they face significant challenges regarding reliability and interaction efficiency. Direct execution of LLM-generated commands often leads to entity hallucinations (e.g., trying to control non-existent devices). Meanwhile, existing iterative frameworks (e.g., SAGE) suffer from...
29
A Scoping Review of AI-Driven Digital Interventions in Mental Health Care: Mapping Applications Across Screening, Support, Monitoring, Prevention, and Clinical Education
Yang Ni, Fanli Jia
📅 2026-03-17
Artificial intelligence (AI)-enabled digital interventions, including Generative AI (GenAI) and Human-Centered AI (HCAI), are increasingly used to expand access to digital psychiatry and mental health care. This PRISMA-ScR scoping review maps the landscape of AI-driven mental health (mHealth) technologies across five critical phases: pre-treatment (screening/triage), treatment (therapeutic...
30
MemX: A Local-First Long-Term Memory System for AI Assistants
Lizheng Sun
📅 2026-03-17
We present MemX, a local-first long-term memory system for AI assistants with stability-oriented retrieval design. MemX is implemented in Rust on top of libSQL and an OpenAI-compatible embedding API, providing persistent, searchable, and explainable memory for conversational agents. Its retrieval pipeline applies vector recall, keyword recall, Reciprocal Rank Fusion (RRF), four-factor re-ranking,...
31
SQL-ASTRA: Alleviating Sparse Feedback in Agentic SQL via Column-Set Matching and Trajectory Aggregation
Long Li, Zhijian Zhou, Jiangxuan Long et al. (8 authors)
📅 2026-03-17
Agentic Reinforcement Learning (RL) shows promise for complex tasks, but Text-to-SQL remains mostly restricted to single-turn paradigms. A primary bottleneck is the credit assignment problem. In traditional paradigms, rewards are determined solely by the final-turn feedback, which ignores the intermediate process and leads to ambiguous credit evaluation. To address this, we propose Agentic SQL, a...
32
GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation
Jiayi Tian, Jiaze Wang
📅 2026-03-17
Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by...
33
Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment
Enguang Fan, Yifan Chen, Zihan Shan et al. (5 authors)
📅 2026-03-17
Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and...
34
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai, Zhiheng Lyu, Yuansheng Ni et al. (16 authors)
📅 2026-03-17
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail...
35
ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning
Tik Yu Yim, Wenting Tan, Sum Yee Chan et al. (5 authors)
📅 2026-03-17
Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex,...
36
VIGIL: Towards Edge-Extended Agentic AI for Enterprise IT Support
Sarthak Ahuja, Neda Kordjazi, Evren Yortucboylu et al. (10 authors)
📅 2026-03-17
Enterprise IT support is constrained by heterogeneous devices, evolving policies, and long-tail failure modes that are difficult to resolve centrally. We present VIGIL, an edge-extended agentic AI system that deploys desktop-resident agents to perform situated diagnosis, retrieval over enterprise knowledge, and policy-governed remediation directly on user devices with explicit consent and...
37
RepoReviewer: A Local-First Multi-Agent Architecture for Repository-Level Code Review
Peng Zhang
📅 2026-03-17
Repository-level code review requires reasoning over project structure, repository context, and file-level implementation details. Existing automated review workflows often collapse these tasks into a single pass, which can reduce relevance, increase duplication, and weaken prioritization. We present RepoReviewer, a local-first multi-agent system for automated GitHub repository review with a...
38
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
Noppanat Wadlom, Junyi Shen, Yao Lu
📅 2026-03-17
Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and...
39
Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
Chang Nie, Tianchen Deng, Guangming Wang et al. (5 authors)
📅 2026-03-17
While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to...
40
Interact3D: Compositional 3D Generation of Interactive Objects
Hui Shan, Keyang Luo, Ming Li et al. (7 authors)
📅 2026-03-17
Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework...
41
ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning
Yu Li, Rui Miao, Zhengling Qi et al. (4 authors)
📅 2026-03-17
The dominant paradigm for improving mathematical reasoning in language models relies on Reinforcement Learning with verifiable rewards. Yet existing methods treat each problem instance in isolation without leveraging the reusable strategies that emerge and accumulate during training. To this end, we introduce ARISE (Agent Reasoning via Intrinsic Skill Evolution), a hierarchical reinforcement...
42
Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation
Dongik Shin
📅 2026-03-17
Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performance can be limited when encountering completely new environments. This paper proposes a parameter-efficient fine-tuning strategy to enhance the...
43
Interpretable Context Methodology: Folder Structure as Agentic Architecture
Jake Van Clief, David McDermott
📅 2026-03-17
Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But for sequential workflows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper...
44
IRAM-Omega-Q: A Computational Architecture for Uncertainty Regulation in Artificial Agents
Veronique Ziegler
📅 2026-03-16
Artificial agents can achieve strong task performance while remaining opaque with respect to internal regulation, uncertainty management, and stability under stochastic perturbation. We present IRAM-Omega-Q, a computational architecture that models internal regulation as closed-loop control over a quantum-like state representation. The framework uses density matrices instrumentally as abstract...
45
Evaluating Agentic Optimization on Large Codebases
Atharva Sehgal, James Hou, Akanksha Sarkar et al. (7 authors)
📅 2026-03-16
Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce...
46
From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI
Cosimo Spera, Garima Agrawal, Riccardo De Maria
📅 2026-03-16
Customer service automation is undergoing a structural transformation. The dominant paradigm is shifting from scripted chatbots and single-agent responders toward networks of specialised AI agents that compose capabilities dynamically across billing, service provision, payments, and fulfilment. This shift introduces a safety gap that no current platform has closed: two agents individually...
47
An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc
Hong Zhang, Barry Smith, Satish Balay et al. (7 authors)
📅 2026-03-16
While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To...
48
Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems
Cosimo Spera
📅 2026-03-16
This paper contains the first formal proof that safety is non-compositional in the presence of conjunctive capability dependencies: two agents each individually inca- pable of reaching any forbidden capability can, when combined, collectively reach a forbidden goal through an emergent conjunctive dependency.
49
MAC: Multi-Agent Constitution Learning
Rushil Thareja, Gautam Gupta, Francesco Pinto et al. (4 authors)
📅 2026-03-16
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled...
50
Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents
Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah et al. (7 authors)
📅 2026-03-16
Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for...