← Back to Homepage

Retrieval-augmented Generation

检索增强生成研究

📊 50 Papers 📅 Updated: 2026-05-14
1
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz et al. (13 authors)
📅 2026-05-13
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end...
2
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
Wenrui Bao, Huan Wang, Jian Wang et al. (6 authors)
📅 2026-05-13
Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of...
3
VectorSmuggle: Steganographic Exfiltration in Embedding Stores and a Cryptographic Provenance Defense
Jascha Wanger
📅 2026-05-13
Modern retrieval-augmented generation (RAG) systems convert sensitive content into high-dimensional embeddings and store them in vector databases that treat the resulting numerical artifacts as opaque. Major vector-store products do not provide native controls for embedding integrity, ingestion-time distributional anomaly detection, or cryptographic provenance attestation. We show this opens a...
4
Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
Qian Shen, Fanghua Cao, Min Yao et al. (6 authors)
📅 2026-05-13
Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated...
5
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
Andrea Morandi
📅 2026-05-13
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no...
6
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira et al. (4 authors)
📅 2026-05-13
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the...
7
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Kaiyuan Liu, Ziyuan Zhuang, Yang Bai et al. (6 authors)
📅 2026-05-13
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later...
8
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
Takumi Goto, Yusuke Sakai, Taro Watanabe
📅 2026-05-13
Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean,...
9
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Chengzhi Shen, Weixiang Shen, Tobias Susetzky et al. (11 authors)
📅 2026-05-13
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal...
10
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
Anuj Sadani, Deepak Kumar
📅 2026-05-13
Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier...
11
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Ye Wang, Jing Liu, Toshiaki Koike-Akino
📅 2026-05-13
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing...
12
AI-Generated Slides: Are They Good? Can Students Tell?
Juho Leinonen, Lisa Zhang, Arto Hellas
📅 2026-05-13
As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding...
13
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models
Xinye Wanyan, Chenglong Ma, Danula Hettiachchi et al. (5 authors)
📅 2026-05-13
Large Language Model (LLM)-based agent simulation has emerged as a promising approach to meet the increasing demand for real-time and rigorous evaluation in modern recommender systems. A typical LLM-driven simulation framework comprises three essential components: the profile module, memory module, and action module. However, existing studies have primarily concentrated on enhancing the memory...
14
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
Mikhail Menschikov, Matvey Iskornev, Alexander Kharitonov et al. (11 authors)
📅 2026-05-13
We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its...
15
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
Chenyu Zhou, Hongpei Li, Yuerou Liu et al. (6 authors)
📅 2026-05-13
Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet...
16
LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
Adam Remaki, Xavier Tannier, Christel Gérardin
📅 2026-05-13
Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface...
17
Cognifold: Always-On Proactive Memory via Cognitive Folding
Suli Wang, Yiqun Duan, Yu Deng et al. (6 authors)
📅 2026-05-13
Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into...
18
From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks
Neh Majmudar, Anne Huang, Jinfan Frank Hu et al. (4 authors)
📅 2026-05-13
In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the...
19
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
Daniel Fernández-González, Cristina Outeiriño Cid
📅 2026-05-13
To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models...
20
Query-Conditioned Test-Time Self-Training for Large Language Models
Chaehee Song, Minseok Seo, Yeeun Seong et al. (5 authors)
📅 2026-05-13
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates...
21
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Shaomu Tan, Dawei Zhu, Ke Tran et al. (8 authors)
📅 2026-05-13
Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation,...
22
IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
Joy Bose
📅 2026-05-13
Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge...
23
PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users
Hannah Rose Kirk, Liu Leqi, Fanzhi Zeng et al. (7 authors)
📅 2026-05-13
Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through...
24
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Yafu Li, Runzhe Zhan, Haoran Zhang et al. (28 authors)
📅 2026-05-13
Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a...
25
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
Tom Zehle
📅 2026-05-13
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these...
26
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel
📅 2026-05-13
Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The...
27
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Weiqing Luo, Zongye Hu, Xiao Wang et al. (6 authors)
📅 2026-05-13
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence...
28
GAGPO: Generalized Advantage Grouped Policy Optimization
Siyuan Zhu, Chao Yu, Rongxin Yang et al. (7 authors)
📅 2026-05-13
Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back...
29
LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information
Stef van Buuren
📅 2026-05-13
Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imputation (MI) literature: uncertainty should scale with the amount of missing information. We assess this criterion on...
30
GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
Jinwoong Kim, Rui Yang, Huishuai Zhang
📅 2026-05-13
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given...
31
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
Chenjun Xu, Zhennan Zhou, Zhan Su et al. (6 authors)
📅 2026-05-13
Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher...
32
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz et al. (7 authors)
📅 2026-05-13
Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to...
33
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
Guoxiong Gao, Zeming Sun, Jiedong Jiang et al. (8 authors)
📅 2026-05-13
Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise...
34
A Multi-Agent Orchestration Framework for Venture Capital Due Diligence
Grigorios Alexandrou, Katerina Pramatari
📅 2026-05-13
We present a fully automated multi-agent framework for corporate due diligence and market analysis in venture capital. The system runs on an event-driven orchestration architecture, combining Large Language Models (LLMs) with real-time web retrieval to synthesize unstructured data into structured investment intelligence. A central technical contribution is a programmatic extraction pipeline that...
35
Does language matter for spoken word classification? A multilingual generative meta-learning approach
Batsirayi Mupamhi Ziki, Louise Beyers, Ruan van der Merwe
📅 2026-05-13
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable...
36
TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints
Yoshio Kato, Shuhei Tarashima
📅 2026-05-13
The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel...
37
Scaling few-shot spoken word classification with generative meta-continual learning
Louise Beyers, Batsirayi Mupamhi Ziki, Ruan van der Merwe
📅 2026-05-13
Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We...
38
The Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI
Ao Liu, Shanhua Zhu
📅 2026-05-13
The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of...
39
RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search
Tingyu Chen, Wenkai Zhang, Li Gao et al. (7 authors)
📅 2026-05-13
In commercial web search, aligning content freshness with user intent remains challenging due to the highly varied lifespans of information. Traditional industrial approaches rely on static time-window filtering, resulting in "one-size-fits-all" rankings where content may be chronologically recent but semantically expired. To address the limitation, we present a novel Large Language...
40
Context Training with Active Information Seeking
Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng et al. (7 authors)
📅 2026-05-13
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the...
41
Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models
Yejin Lee, Yo-Sub Han
📅 2026-05-13
Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce...
42
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
Zhuofan Shi, Peilun Jia, Baoqin Sun et al. (7 authors)
📅 2026-05-13
Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep...
43
Understanding and Accelerating the Training of Masked Diffusion Language Models
Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai et al. (8 authors)
📅 2026-05-13
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this...
44
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education
Mragisha Jain, Tirth Bhatt, Griffin Pitts et al. (9 authors)
📅 2026-05-13
Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving...
45
Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction
Guangzeng Han, James G. Murphy, Benjamin O. Ladd et al. (5 authors)
📅 2026-05-13
BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI...
46
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
Zijing Wang, Mingyang Wang, Ercong Nie et al. (9 authors)
📅 2026-05-13
Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an...
47
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
Haodong Wu, Jiahao Zhang, Lijie Hu et al. (4 authors)
📅 2026-05-13
Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw...
48
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Vardhan Dongre, Joseph Hsieh, Viet Dac Lai et al. (6 authors)
📅 2026-05-13
Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may...
49
CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models
Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner
📅 2026-05-13
To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit...
50
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings
Ayuto Tsutsumi, Ryosuke Kohita
📅 2026-05-13
A scene of two people in the rain can convey hope and warmth in a reunion story or sorrow and finality in a farewell story. We investigate this context-dependent nature of image meaning and its implications for retrieval. Our key observation is that context dependency correlates with semantic abstraction: concrete elements (objects, actions) remain stable across contexts, while abstract elements...