LLM Reasoning - Latest arXiv Papers

1

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann et al. (4 authors)

📅 2026-03-31

Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically...

arXiv → PDF

2

The Triadic Cognitive Architecture: Bounding Autonomous Action via Spatio-Temporal and Epistemic Friction

Davide Di Gioia

📅 2026-03-31

Current autonomous AI agents, driven primarily by Large Language Models (LLMs), operate in a state of cognitive weightlessness: they process information without an intrinsic sense of network topology, temporal pacing, or epistemic limits. Consequently, heuristic agentic loops (e.g., ReAct) can exhibit failure modes in interactive environments, including excessive tool use under congestion,...

arXiv → PDF

3

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

Md Saad, Sajjad Hussain, Mohd Suhaib

📅 2026-03-31

This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This...

arXiv → PDF

4

Enhancing Structural Mapping with LLM-derived Abstractions for Analogical Reasoning in Narratives

Mohammadhossein Khojasteh, Yifan Jiang, Stefano De Giorgis et al. (5 authors)

📅 2026-03-31

Analogical reasoning is a key driver of human generalization in problem-solving and argumentation. Yet, analogies between narrative structures remain challenging for machines. Cognitive engines for structural mapping are not directly applicable, as they assume pre-extracted entities, whereas LLMs' performance is sensitive to prompt format and the degree of surface similarity between...

arXiv → PDF

5

Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

Iain Swift, JingHua Ye

📅 2026-03-31

Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664...

arXiv → PDF

6

C-TRAIL: A Commonsense World Framework for Trajectory Planning in Autonomous Driving

Zhihong Cui, Haoran Tang, Tianyi Li et al. (7 authors)

📅 2026-03-31

Trajectory planning for autonomous driving increasingly leverages large language models (LLMs) for commonsense reasoning, yet LLM outputs are inherently unreliable, posing risks in safety-critical applications. We propose C-TRAIL, a framework built on a Commonsense World that couples LLM-derived commonsense with a trust mechanism to guide trajectory planning. C-TRAIL operates through a...

arXiv → PDF

7

SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

Adar Avsian, Larry Heck

📅 2026-03-31

Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction...

arXiv → PDF

8

Reasoning-Driven Synthetic Data Generation and Evaluation

Tim R. Davidson, Benoit Seguin, Enrico Bacis et al. (5 authors)

📅 2026-03-31

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods...

arXiv → PDF

9

From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

Ganen Sethupathy, Lalit Dumka, Jan Schagen

📅 2026-03-31

Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and...

arXiv → PDF

10

Training-Free Dynamic Upcycling of Expert Language Models

Eros Fanì, Oğuzhan Ersoy

📅 2026-03-31

Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization,...

arXiv → PDF

11

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson et al. (4 authors)

📅 2026-03-31

Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates...

arXiv → PDF

12

Spontaneous Functional Differentiation in Large Language Models: A Brain-Like Intelligence Economy

Junjie Zhang, Zhen Shen, Gang Xiong et al. (4 authors)

📅 2026-03-31

The evolution of intelligence in artificial systems provides a unique opportunity to identify universal computational principles. Here we show that large language models spontaneously develop synergistic cores where information integration exceeds individual parts remarkably similar to the human brain. Using Integrated Information Decomposition across multiple architectures we find that middle...

arXiv → PDF

13

Reinforced Reasoning for End-to-End Retrosynthetic Planning

Chenyang Zuo, Siqi Fan, Yizhen Luo et al. (4 authors)

📅 2026-03-31

Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To...

arXiv → PDF

14

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

Joakim Edin, Andreas Motzfeldt, Simon Flachs et al. (4 authors)

📅 2026-03-31

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby...

arXiv → PDF

15

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

Zhenning Chen, Hanbei Zhan, Yanwei Huang et al. (7 authors)

📅 2026-03-31

Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely...

arXiv → PDF

16

View-oriented Conversation Compiler for Agent Trace Analysis

Lvmin Zhang, Maneesh Agrawala

📅 2026-03-31

Agent traces carry increasing analytical value in the era of context learning and harness-driven agentic cognition, yet most prior work treats conversation format as a trivial engineering detail. Modern agent conversations contain deeply structured content, including nested tool calls and results, chain-of-thought reasoning blocks, sub-agent invocations, context-window compaction boundaries, and...

arXiv → PDF

17

Concept frustration: Aligning human concepts and machine representations

Enrico Parisini, Christopher J. Soelistyo, Ahab Isaac et al. (5 authors)

📅 2026-03-31

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery,...

arXiv → PDF

18

Learning Diagnostic Reasoning for Decision Support in Toxicology

Nico Oberländer, David Bani-Harouni, Tobias Zellner et al. (6 authors)

📅 2026-03-31

Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with...

arXiv → PDF

19

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Luoxin Chen, Yichi Zhou, Huishuai Zhang

📅 2026-03-31

Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this...

arXiv → PDF

20

Hallucination-aware intermediate representation edit in large vision-language models

Wei Suo, Hanzu Zhang, Lijun Zhang et al. (6 authors)

📅 2026-03-31

Large Vision-Language Models have demonstrated exceptional performance in multimodal reasoning and complex scene understanding. However, these models still face significant hallucination issues, where outputs contradict visual facts. Recent research on hallucination mitigation has focused on retraining methods and Contrastive Decoding (CD) methods. While both methods perform well, retraining...

arXiv → PDF

21

Is my model perplexed for the right reason? Contrasting LLMs' Benchmark Behavior with Token-Level Perplexity

Zoë Prins, Samuele Punzo, Frank Wildenburg et al. (5 authors)

📅 2026-03-31

Standard evaluations of Large language models (LLMs) focus on task performance, offering limited insight into whether correct behavior reflects appropriate underlying mechanisms and risking confirmation bias. We introduce a simple, principled interpretability framework based on token-level perplexity to test whether models rely on linguistically relevant cues. By comparing perplexity...

arXiv → PDF

22

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

Yahan Li, Xinyi Jie, Wanjia Ruan et al. (8 authors)

📅 2026-03-31

Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their...

arXiv → PDF

23

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Hongyi Nie, Xunyuan Liu, Yudong Bai et al. (7 authors)

📅 2026-03-31

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this...

arXiv → PDF

24

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy et al. (9 authors)

📅 2026-03-31

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple...

arXiv → PDF

25

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang et al. (6 authors)

📅 2026-03-31

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve...

arXiv → PDF

26

AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction

Harsh Mankodiya, Chase Gallik, Theodoros Galanos et al. (4 authors)

📅 2026-03-31

The AEC-Bench is a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross-sheet reasoning, and construction project-level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across...

arXiv → PDF

27

Multi-Layered Memory Architectures for LLM Agents: An Experimental Evaluation of Long-Term Context Retention

Sunil Tiwari, Payal Fofadiya

📅 2026-03-31

Long-horizon dialogue systems suffer from semanticdrift and unstable memory retention across extended sessions. This paper presents a Multi-Layer Memory Framework that decomposes dialogue history into working, episodic, and semantic layers with adaptive retrieval gating and retention regularization. The architecture controls cross-session drift while maintaining bounded context growth and...

arXiv → PDF

28

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao, Lei Chen, Mingfei Han et al. (8 authors)

📅 2026-03-31

Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future...

arXiv → PDF

29

SemLoc: Structured Grounding of Free-Form LLM Reasoning for Fault Localization

Zhaorui Yang, Haichao Zhu, Qian Zhang et al. (5 authors)

📅 2026-03-31

Fault localization identifies program locations responsible for observed failures. Existing techniques rank suspicious code using syntactic spectra--signals derived from execution structure such as statement coverage, control-flow divergence, or dependency reachability. These signals collapse for semantic bugs, where failing and passing executions follow identical code paths and differ only in...

arXiv → PDF

30

PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

Xingyu Li, Rongguang Wang, Yuying Wang et al. (8 authors)

📅 2026-03-30

Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate...

arXiv → PDF

31

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng et al. (5 authors)

📅 2026-03-30

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication...

arXiv → PDF

32

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Yubo Li, Lu Zhang, Tianchong Jiang et al. (5 authors)

📅 2026-03-30

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal,...

arXiv → PDF

33

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

Diego C. Lerma-Torres

📅 2026-03-30

Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral...

arXiv → PDF

34

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani et al. (5 authors)

📅 2026-03-30

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may...

arXiv → PDF

35

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Zifan He, Rui Ma, Yizhou Sun et al. (4 authors)

📅 2026-03-30

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply...

arXiv → PDF

36

Multi-Agent LLMs for Adaptive Acquisition in Bayesian Optimization

Andrea Carbonati, Mohammadsina Almasi, Hadis Anahideh

📅 2026-03-30

The exploration-exploitation trade-off is central to sequential decision-making and black-box optimization, yet how Large Language Models (LLMs) reason about and manage this trade-off remains poorly understood. Unlike Bayesian Optimization, where exploration and exploitation are explicitly encoded through acquisition functions, LLM-based optimization relies on implicit, prompt-based reasoning...

arXiv → PDF

37

Enhancing Policy Learning with World-Action Model

Yuci Han, Alper Yilmaz

📅 2026-03-30

This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned...

arXiv → PDF

38

CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation

Andrew Bouras, OMS-II Research Fellow

📅 2026-03-30

Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518),...

arXiv → PDF

39

ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

Rongtian Ye

📅 2026-03-30

Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources,...

arXiv → PDF

40

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Thomas Weng, Karl Schmeckpeper et al. (6 authors)

📅 2026-03-30

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To...

arXiv → PDF

41

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao, Haonan Jia, Lijun Li et al. (7 authors)

📅 2026-03-30

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving...

arXiv → PDF

42

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao et al. (9 authors)

📅 2026-03-30

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with...

arXiv → PDF

43

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Han Wang, Yifan Sun, Brian Ko et al. (11 authors)

📅 2026-03-30

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT...

arXiv → PDF

44

Towards a Medical AI Scientist

Hongtao Wu, Boyun Zheng, Dingjie Song et al. (8 authors)

📅 2026-03-30

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work,...

arXiv → PDF

45

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Yanjie Zhang, Yafei Li, Rui Sheng et al. (8 authors)

📅 2026-03-30

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a "skeptical" reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a...

arXiv → PDF

46

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu, Guangquan Hu, Chenghuang Shen et al. (18 authors)

📅 2026-03-30

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on...

arXiv → PDF

47

Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

Iman Sharifi, Alex Zongo, Peng Wei

📅 2026-03-30

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must...

arXiv → PDF

48

T-Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz, Product, and Gödel Semantics in a Neuro-Symbolic Reasoning System

Adam Laabs

📅 2026-03-30

We present a first comparative pilot study of three t-norm operators -- Lukasiewicz (T_L), Product (T_P), and Gödel (T_G) - as logical conjunction mechanisms in a neuro-symbolic reasoning system for EU AI Act compliance classification. Using the LGGT+ (Logic-Guided Graph Transformers Plus) engine and a benchmark of 1035 annotated AI system descriptions spanning four risk categories (prohibited,...

arXiv → PDF

49

Training data generation for context-dependent rubric-based short answer grading

Pavel Šindelář, Dávid Slivka, Christopher Bouma et al. (5 authors)

📅 2026-03-30

Every four years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to consider methods of automatic student answer grading. To train some of these...

arXiv → PDF

50

GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Shuwen Xu, Yao Xu, Jiaxiang Liu et al. (7 authors)

📅 2026-03-30

Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization. Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to...

arXiv → PDF