← Back to Homepage

Retrieval-augmented Generation

检索增强生成研究

📊 50 Papers 📅 Updated: 2026-03-18
1
Efficient Reasoning on the Edge
Yelysei Bondarenko, Thomas Hehn, Rob Hesselink et al. (18 authors)
📅 2026-03-17
Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller...
2
Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory
Sahil Sen, Elias Lumer, Anmol Gulati et al. (4 authors)
📅 2026-03-17
Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue...
3
Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen et al. (4 authors)
📅 2026-03-17
The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely...
4
Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights
Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar et al. (5 authors)
📅 2026-03-17
Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering...
5
TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities
Victoria Graf, Valentina Pyatkin, Nouha Dziri et al. (5 authors)
📅 2026-03-17
Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to...
6
Probing Cultural Signals in Large Language Models through Author Profiling
Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys et al. (5 authors)
📅 2026-03-17
Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gender and ethnicity without task-specific fine-tuning. Across several open-source models...
7
Retrieving Counterfactuals Improves Visual In-Context Learning
Guangzhi Xiong, Sanchit Sinha, Zhenghao He et al. (4 authors)
📅 2026-03-17
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration...
8
IQuest-Coder-V1 Technical Report
Jian Yang, Wei Zhang, Shawn Guo et al. (38 authors)
📅 2026-03-17
In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with...
9
Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Xiaojie Gu, Sherry T. Tong, Aosong Feng et al. (11 authors)
📅 2026-03-17
Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To...
10
Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy
Zhaoxin Feng, Zheng Chen, Jianfei Ma et al. (6 authors)
📅 2026-03-17
Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to...
11
When AI Navigates the Fog of War
Ming Li, Xirui Li, Tianyi Zhou
📅 2026-03-17
Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier...
12
When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective
Zelin Zhang, Fei Cheng, Chenhui Chu
📅 2026-03-17
Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic...
13
Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects
Titus von der Malsburg, Sebastian Padó
📅 2026-03-17
Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction...
14
EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models
Yifei Zhang, Mingyang Li, Henry Gao et al. (4 authors)
📅 2026-03-17
Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the...
15
AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
Shannan Yan, Jingchen Ni, Leqi Zheng et al. (9 authors)
📅 2026-03-17
Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated...
16
On the Emotion Understanding of Synthesized Speech
Yuan Ge, Haishu Zhao, Aokai Hao et al. (13 authors)
📅 2026-03-17
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically...
17
DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning
Yanyu Qian, Yue Tan, Yixin Liu et al. (5 authors)
📅 2026-03-17
Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential...
18
VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization
Yixuan Wang, Qingyu Shi, Jiayu Zhou et al. (6 authors)
📅 2026-03-17
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel,...
19
IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time
Zhenghua Bao, Yi Shi
📅 2026-03-17
Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG...
20
PlotTwist: A Creative Plot Generation Framework with Small Language Models
Abhinav Thorat, Ravi Kolla, Jyotin Goel et al. (4 authors)
📅 2026-03-17
Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized...
21
Fanar 2.0: Arabic Generative AI Stack
FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad et al. (37 authors)
📅 2026-03-17
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic...
22
Omnilingual MT: Machine Translation for 1,600 Languages
Omnilingual MT Team, Belen Alastruey, Niyati Bafna et al. (31 authors)
📅 2026-03-17
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to...
23
More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
Song Tae-Eun
📅 2026-03-17
Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants...
24
ReFORM: Review-aggregated Profile Generation via LLM with Multi-Factor Attention for Restaurant Recommendation
Moonsoo Park, Seulbeen Je, Donghyeon Park
📅 2026-03-17
In recommender systems, large language models (LLMs) have gained popularity for generating descriptive summarization to improve recommendation robustness, along with Graph Convolution Networks. However, existing LLM-enhanced recommendation studies mainly rely on the internal knowledge of LLMs about item titles while neglecting the importance of various factors influencing users' decisions....
25
SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation
Hang Lv, Sheng Liang, Hao Wang et al. (9 authors)
📅 2026-03-17
Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric...
26
Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
Xiaobing Sun, Perry Lam, Shaohua Li et al. (7 authors)
📅 2026-03-17
Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional...
27
Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation
Surya Vardhan Yalavarthi
📅 2026-03-17
Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing...
28
Parametric Social Identity Injection and Diversification in Public Opinion Simulation
Hexi Wang, Yujia Zhou, Bangde Du et al. (5 authors)
📅 2026-03-17
Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this...
29
Answer Bubbles: Information Exposure in AI-Mediated Search
Michelle Huang, Agam Goyal, Koustuv Saha et al. (4 authors)
📅 2026-03-17
Generative search systems are increasingly replacing link-based retrieval with AI-generated summaries, yet little is known about how these systems differ in sources, language, and fidelity to cited material. We examine responses to 11,000 real search queries across four systems -- vanilla GPT, Search GPT, Google AI Overviews, and traditional Google Search -- at three levels: source diversity,...
30
SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment
Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang
📅 2026-03-17
Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these...
31
SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era
Han Jang, Junhyeok Lee, Kyu Sung Choi
📅 2026-03-17
The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in...
32
Social Simulacra in the Wild: AI Agent Communities on Moltbook
Agam Goyal, Olivia Pal, Hari Sundaram et al. (5 authors)
📅 2026-03-17
As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find...
33
ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning
Tik Yu Yim, Wenting Tan, Sum Yee Chan et al. (5 authors)
📅 2026-03-17
Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex,...
34
CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering
Tianyi Huang, Ying Kai Deng
📅 2026-03-17
In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and...
35
RecBundle: A Next-Generation Geometric Paradigm for Explainable Recommender Systems
Hui Wang, Tianzhu Hu, Mingming Li et al. (9 authors)
📅 2026-03-17
Recommender systems are inherently dynamic feedback loops where prolonged local interactions accumulate into macroscopic structural degradation such as information cocoons. Existing representation learning paradigms are universally constrained by the assumption of a single flat space, forcing topologically grounded user associations and semantically driven historical interactions to be fitted...
36
Resource Consumption Threats in Large Language Models
Yuanhe Zhang, Xinyue Wang, Zhican Chen et al. (10 authors)
📅 2026-03-17
Given limited and costly computational infrastructure, resource efficiency is a key requirement for large language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consumption threats induce excessive generation, degrading model efficiency and harming both service availability and economic sustainability. This survey...
37
RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation
Saisha Pradeep Shetty, Roger Eric Goldman, Vladimir Filkov
📅 2026-03-16
Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work....
38
Visual Set Program Synthesizer
Zehua Cheng, Wei Dai, Wenhu Zhang et al. (5 authors)
📅 2026-03-16
A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for...
39
Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
Jingxiang Chen, Minseok Kim, Seong-Gyun Leem et al. (16 authors)
📅 2026-03-16
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that...
40
MAC: Multi-Agent Constitution Learning
Rushil Thareja, Gautam Gupta, Francesco Pinto et al. (4 authors)
📅 2026-03-16
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled...
41
MoLoRA: Composable Specialization via Per-Token Adapter Routing
Shrey Shah, Justin Wagle
📅 2026-03-16
Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like "write code to solve this equation," which need expertise from...
42
Machine Translation in the Wild: User Reaction to Xiaohongshu's Built-In Translation Feature
Sui He
📅 2026-03-16
The growing integration of machine translation into social media platforms is transforming how users interact with each other across cultural and linguistic boundaries. This paper examines user reactions to the launch of Xiaohongshu's built-in translation feature in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this...
43
Prompt Engineering for Scale Development in Generative Psychometrics
Lara Lee Russell-Lasalandra, Hudson Golino
📅 2026-03-16
This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then...
44
Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN
Ritajit Dey, Iadh Ounis, Graham McDonald et al. (4 authors)
📅 2026-03-16
Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that...
45
Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs
Yara Alakeel, Chatrine Qwaider, Hanan Aldarmaki et al. (4 authors)
📅 2026-03-16
This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices...
46
Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning
Aozhe Wang, Yuchen Yan, Nan Zhou et al. (8 authors)
📅 2026-03-16
Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model...
47
From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
Yibin Liu, Yaxing Lyu, Daqi Gao et al. (8 authors)
📅 2026-03-16
Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1...
48
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Yuwen Du, Rui Ye, Shuo Tang et al. (7 authors)
📅 2026-03-16
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating...
49
Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation
Yanick Zengaffinen, Andreas Opedal, Donya Rooein et al. (6 authors)
📅 2026-03-16
Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for...
50
Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models
Xiyu Liu, Qingyi Si, Zhengxiao Liu et al. (6 authors)
📅 2026-03-16
While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric...