← Back to Homepage

LLM Reasoning

大语言模型推理能力研究

📊 50 Papers 📅 Updated: 2026-05-14
1
WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
Ziheng Zhang, Yunzhong Hou, Naijing Liu et al. (4 authors)
📅 2026-05-13
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using...
2
Harnessing Agentic Evolution
Jiayi Zhang, Yongfeng Gu, Jianhao Ruan et al. (13 authors)
📅 2026-05-13
Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but...
3
An LLM-Based System for Argument Reconstruction
Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman et al. (4 authors)
📅 2026-05-13
Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant...
4
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
Mohammad Reza Mousavi
📅 2026-05-13
Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level...
5
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
Tyler Alvarez, Ali Baheri
📅 2026-05-13
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through...
6
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
Yitian Yang, Yiqun Duan, Linghan Huang et al. (7 authors)
📅 2026-05-13
Large language model (LLM)-based multi-agent simulation offers a powerful testbed for studying social opinion dynamics. Yet current approaches often adopt two contrasting methods: either relying on fixed update rules with limited cognitive grounding or delegating belief change largely to unconstrained LLM interaction. We introduce ScioMind, a cognitively grounded simulation framework that bridges...
7
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
Andrea Morandi
📅 2026-05-13
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no...
8
A Hierarchical Language Model with Predictable Scaling Laws and Provable Benefits of Reasoning
Jason Gaitonde, Frederic Koehler, Elchanan Mossel et al. (5 authors)
📅 2026-05-13
We introduce a family of synthetic languages with hierarchical structure -- generated by a broadcast process on trees -- for which the role of context length and reasoning in autoregressive generation can be analyzed precisely. At the heart of our analytic approach is an \emph{exact $k$-gram ansatz} in place of transformers with context length $k$, a substitution we then validate empirically....
9
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
Zhongju Yuan, Geraint Wiggins, Dick Botteldooren
📅 2026-05-13
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a...
10
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
Junyan Li, Zhang-Wei Hong, Maohao Shen et al. (5 authors)
📅 2026-05-13
Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing...
11
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
Seokha Moon, Minseung Lee, Joon Seo et al. (5 authors)
📅 2026-05-13
End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to...
12
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Yang Bai, Kaiyuan Liu, Ziyuan Zhuang et al. (8 authors)
📅 2026-05-13
Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both...
13
How to Interpret Agent Behavior
Jie Gao, Kaiser Sun, Jen-tse Huang et al. (11 authors)
📅 2026-05-13
Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in...
14
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
Peng Kang, Bixuan Li, Xiaoya Huang et al. (8 authors)
📅 2026-05-13
The Materials Genome Initiative catalyzed the proliferation of centralized platforms--SaaS, PaaS, and IaaS--that aggregate computational and experimental resources for accelerated materials discovery. In parallel, breakthroughs in large language models (LLMs) and autonomous agents have created powerful new reasoning capabilities for scientific research. Yet a critical "last mile"...
15
Unweighted ranking for value-based decision making with uncertainty
Aarón López García, Natalia Criado, Jose Such
📅 2026-05-13
As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this...
16
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Chengzhi Shen, Weixiang Shen, Tobias Susetzky et al. (11 authors)
📅 2026-05-13
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal...
17
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
Jiabei Liu, Wenyu Mao, Junfei Tan et al. (7 authors)
📅 2026-05-13
Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary...
18
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
Jincai Huang, Shihao Zou, Yuchen Guo et al. (8 authors)
📅 2026-05-13
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components...
19
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Tsz Ting Chung, Lemao Liu, Mo Yu et al. (4 authors)
📅 2026-05-13
In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot...
20
Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten
📅 2026-05-13
Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this...
21
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
Mikhail Menschikov, Matvey Iskornev, Alexander Kharitonov et al. (11 authors)
📅 2026-05-13
We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its...
22
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong et al. (9 authors)
📅 2026-05-13
Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find...
23
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
Zabir Al Nazi, Shubhashis Roy Dipta
📅 2026-05-13
Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether...
24
Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
Xiaozhe Zhang, Chaozhuo Li, Hui Liu et al. (7 authors)
📅 2026-05-13
Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To...
25
From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks
Neh Majmudar, Anne Huang, Jinfan Frank Hu et al. (4 authors)
📅 2026-05-13
In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the...
26
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
Liangtian Liu, Zeyuan Wang, Ziyu Li et al. (11 authors)
📅 2026-05-13
The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or...
27
Query-Conditioned Test-Time Self-Training for Large Language Models
Chaehee Song, Minseok Seo, Yeeun Seong et al. (5 authors)
📅 2026-05-13
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates...
28
A Horn extension of DL-Lite with NL data complexity
Janos Arpasi, Bartosz Jan Bednarczyk, Magdalena Ortiz
📅 2026-05-13
The literature on ontology-mediated query answering (OMQA) has been shaped by two key results: first-order rewritability for DL-Lite, and PTime-hardness of data complexity for essentially every description logic beyond it. This has effectively positioned DL-Lite as the only practical choice for query rewriting, restricting OMQA solutions to first-order queries and ontologies that can be rewritten...
29
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
Shuqiang Wang, Wei Cao, Jiaqi Weng et al. (7 authors)
📅 2026-05-13
Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs....
30
Diversity of Extensions in Abstract Argumentation
Johannes K. Fichte, Markus Hecher, Yasir Mahmood et al. (4 authors)
📅 2026-05-13
Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in...
31
FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages
Sarmistha Das, Vaibhav Vishal, Syed Ibrahim Ahmad et al. (5 authors)
📅 2026-05-13
Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali,...
32
IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
Joy Bose
📅 2026-05-13
Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge...
33
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Yafu Li, Runzhe Zhan, Haoran Zhang et al. (28 authors)
📅 2026-05-13
Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a...
34
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
Tom Zehle
📅 2026-05-13
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these...
35
What properties of reasoning supervision are associated with improved downstream model quality?
Mikołaj Langner, Dzmitry Pihulski, Jan Eliasz et al. (8 authors)
📅 2026-05-13
Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct...
36
Delightful Exploration
Ian Osband
📅 2026-05-13
Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected...
37
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Weiqing Luo, Zongye Hu, Xiao Wang et al. (6 authors)
📅 2026-05-13
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence...
38
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
Junlong Ke, Zichen Wen, Weijia Li et al. (5 authors)
📅 2026-05-13
On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution....
39
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
Rabindra Lamsal, Sisi Zlatanova, Haowen Xu et al. (5 authors)
📅 2026-05-13
Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a...
40
Teacher-Guided Policy Optimization for LLM Distillation
Xinyu Liu, Kechen Jiao, Chunyang Xiao et al. (12 authors)
📅 2026-05-13
The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative...
41
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
Xiao Liu, Nayu Liu, Junnan Zhu et al. (9 authors)
📅 2026-05-13
Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for...
42
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
Hanwen Zhang, Dusit Niyato, Wei Zhang et al. (5 authors)
📅 2026-05-13
In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile,...
43
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
Hao Zhou, Tiru Wu, Yan Jiang et al. (6 authors)
📅 2026-05-13
Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus...
44
When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling
Ziqi Wang, Yuhao Yang, Zhiwei Ling et al. (5 authors)
📅 2026-05-13
Recent advances in agent and multi-agent systems have shown strong performance on tool use, reasoning, and collaborative tasks. However, existing benchmarks mostly evaluate task completion in weakly coupled environments, and provide limited support for studying coordination in shared, dynamically evolving systems with hierarchy and coupled constraints. This leaves an important question...
45
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Moritz Firsching, Paul Lezeau, Salvatore Mercuri et al. (11 authors)
📅 2026-05-13
As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research...
46
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
Changpeng Wang, Xin Lin, Junhan Liu et al. (8 authors)
📅 2026-05-13
Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM...
47
GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
Jinwoong Kim, Rui Yang, Huishuai Zhang
📅 2026-05-13
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given...
48
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
Chenjun Xu, Zhennan Zhou, Zhan Su et al. (6 authors)
📅 2026-05-13
Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher...
49
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
Rikui Huang, Shengzhe Zhang, Wei Wei
📅 2026-05-13
Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this...
50
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
Guoxiong Gao, Zeming Sun, Jiedong Jiang et al. (8 authors)
📅 2026-05-13
Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise...