
Best AI papers explained
Cut through the noise. We curate and break down the most important AI papers so you don’t have to.
Episodes
Critical Batch Size for LLM Policy Optimization
This paper investigates the critical batch size (CBS) for Large Language Model (LLM) policy optimization, specifically focusing on the GRPO algorithm. The researchers break down gradient noise into inter-prompt and intra-prompt components to determine the point where increasing data parallelism yields diminishing returns. Their findings reveal that on-policy training is primarily limited by noise
Self-supervised User Profile Generation for Personalization
This paper describes a self-supervised framework called BUMP, which is designed to improve how large language models deliver personalized content. Traditionally, creating user profiles for search and recommendation tasks requires expensive, human-labeled data to train the system. To solve this, researchers developed a method that uses a bidirectional ranking objective to learn directly from raw in
From Augmentation to Reconstruction: Guiding the AI Disruption to the Good Place
This paper explores the evolution of artificial intelligence through a three-stage framework of augmentation, automation, and reconstruction. The authors argue that while AI currently improves individual tasks, the most profound economic disruption will only occur when workflows and markets are entirely redesigned around machine capabilities. True transformation is currently stalled by legacy huma
Self-Distilled Agentic Reinforcement Learning
The research paper introduces SDAR (Self-Distilled Agentic Reinforcement Learning), a new framework designed to improve the training of large language model agents in complex, multi-turn environments. While standard reinforcement learning excels at high-level task goals, it often lacks the precise, token-level guidance needed for long interactions. To solve this, the authors identify critical flaw
Subliminal Learning Is Steering Vector Distillation
This research explores subliminal learning, a phenomenon where a student language model inherits behavioral traits from a teacher model even when trained on semantically unrelated data. The authors demonstrate that this process is driven by steering vector distillation, where the teacher’s system prompt acts as a linear direction in activation space that the student internalizes during fine-tuning
Subsidizing Sequential Search
This paper explores a market model where competing firms use subsidies to reduce the cost of product inspection for consumers. Through a subsidy-sorting principle, the authors demonstrate that higher-quality firms naturally offer larger subsidies to signal their value and secure priority in the search order. This behavior results in a unique equilibrium where low-quality firms are ignored, interme
Meta-Harness: End-to-End Optimization of Model Harnesses
This paper introduces Meta-Harness, an innovative system designed to automate harness engineering for large language models. Unlike traditional methods that rely on manual coding or compressed feedback, this system uses an agentic proposer to search through and optimize the code that governs how models store, retrieve, and process information. By utilizing a filesystem to access full execution tra
Self-Improving Language Models with Bidirectional Evolutionary Search
Researchers have developed Bidirectional Evolutionary Search (BES) to overcome the limitations of standard language model sampling, which often struggles with sparse feedback and predictable outputs. While traditional methods like tree search are confined to a narrow "entropy shell" of high-probability responses, BES escapes this range by using evolutionary operators such as crossover an
Generative Modeling via Drifting
This paper discusses Drifting Models, a novel generative modeling paradigm that enables high-quality, one-step image generation without the iterative inference required by diffusion or flow-matching models. Instead of decomposing transformations at the sampling stage, this method evolves a pushforward distribution during the training process by utilizing a neural network optimizer. The core mechan
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
This paper addresses the cost-efficient evaluation of large language models (LLMs) by utilizing multiple AI "judges" with different price points and reliability levels. The researchers formalize this challenge as budgeted heteroskedastic multi-judge estimation, seeking an optimal way to distribute a limited budget across various judges and tasks to achieve the most accurate quality score
Robust AI Personalization Will Require a Human Context Protocol
This paper proposes the Human Context Protocol (HCP), a technical framework designed to give individuals direct control over how their personal preferences shape AI interactions. Currently, AI personalization relies on fragmented data silos and behavioral inferences that often fail to reflect a user’s true intent or values. By establishing a user-owned preference layer, the protocol allows people
Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning
This paper introduces Equilibrium Reasoners (EqR), a novel framework that conceptualizes iterative AI reasoning as a dynamical system converging toward stable latent attractors. By treating the reasoning process as a series of repeated updates to an internal state, the researchers demonstrate that models can scale performance at test-time by simply increasing the number of iterations (depth) or us
Position: The Pre/Post-Training Boundary Should Govern IP in Industry–Academia ML Collaborations
This paper proposes a new contractual framework called PBOS to resolve persistent intellectual property conflicts in industry-academia machine learning collaborations. By involving scientists in legal negotiations, the authors suggest a clear division based on the pre/post-training boundary of a model. Under this model, pre-training artifacts such as code and architectures are treated as open scie
MEMO: Memory as a Model
MEMO (Memory as a Model), a modular framework designed to integrate new, domain-specific knowledge into Large Language Models (LLMs) without the need for expensive retraining. By encoding information into a dedicated, smaller MEMORY model while keeping the primary EXECUTIVE model frozen, the system avoids catastrophic forgetting and remains compatible with proprietary, closed-source models. The p
Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces
This research introduces Agent Bazaar, a multi-agent simulation framework designed to evaluate and improve the Economic Alignment of Large Language Models (LLMs). The authors identify two critical failure modes: The Crash, where agents engage in destructive price-cutting that leads to market collapse, and The Lemon Market, where deceptive agents use multiple identities to flood marketplaces with f
General Preference Reinforcement Learning
This paper introduces General Preference Reinforcement Learning (GPRL), a novel post-training framework designed to align large language models with complex human values. Traditional methods often rely on a scalar reward model, which frequently leads to "reward hacking" as the model exploits a single quality dimension at the expense of others. To resolve this, the authors utilize a Gener
Explaining and Preventing Alignment Collapse in Iterative RLHF
This paper investigates alignment collapse, a phenomenon where iterative reinforcement learning from human feedback (RLHF) fails because the model learns to exploit "blind spots" in the reward model (RM). By framing the interaction between the AI policy and the RM as a Stackelberg game, the authors prove that standard training ignores a crucial parameter-steering term that captures how t
Curriculum Learning-Guided Progressive Distillation in Large Language Models
This paper introduces Curriculum Learning-Guided Progressive Distillation (CLPD), a novel framework designed to enhance the reasoning capabilities of small language models. The authors argue that traditional knowledge distillation fails when a significant capacity gap exists between a powerful teacher and a smaller student. To resolve this, CLPD simultaneously organizes training data from easy to
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
The provided text introduces **VEGAS (Verifier-Guided Action Selection)**, a novel framework designed to improve the reliability of **multimodal large language model (MLLM)** agents in complex, real-world environments. While standard AI agents often fail in new or long-term scenarios by committing to a single, incorrect action, **VEGAS** enables them to "think twice" by sampling multiple potential
How Much Should a Conversational Recommender System Converse?
Researchers from Yale University explore the optimal level of preference elicitation for conversational recommender systems (CRS) powered by generative AI. Their model examines the critical trade-off between the match quality gained through follow-up questions and the communication costs or abandonment risks incurred by users. The study reveals that a platform’s monetization model—whether based on
FUSE: Ensembling Verifiers with Zero Labeled Data
This paper introduces Fully Unsupervised Score Ensembling (FUSE), a novel framework designed to improve the accuracy of large language model (LLM) outputs without requiring human-labeled data. By aggregating scores from multiple imperfect verifiers, FUSE identifies the most reliable responses during the inference process, a technique known as test-time scaling. The method addresses the limitations
EVOLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
This paper introduces EVOLM, an innovative framework for self-evolving language models that improves performance without relying on human annotations or external teacher models. By transforming a model’s internal knowledge into explicit natural-language rubrics, the system creates an autonomous feedback loop where evaluation and generation capabilities improve in tandem. This method utilizes varia
Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity
This paper establishes a theoretical framework for personalized alignment in large language models, specifically identifying the conditions necessary for a model to efficiently adapt to diverse user preferences. The author characterizes a fundamental decision-relevant user diversity condition, which asserts that a population of users must be sufficiently varied to expose all latent reward directio
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
This paper introduces Off-Policy Generative Policy Optimization (OGPO), a novel reinforcement learning algorithm designed to efficiently fine-tune generative control policies (GCPs) for complex robotic tasks. By viewing action generation as a denoising MDP nested within the environmental process, the method utilizes off-policy critics as terminal rewards to optimize the full generative process wit
Adaptive Querying with AI Persona Priors
This paper details a novel Bayesian adaptive querying framework that utilizes AI personas to learn user-specific information within limited question budgets. Traditional methods like Computerized Adaptive Testing often struggle with high-dimensional data or "cold-start" scenarios where little is known about a new user or item. This research addresses these gaps by using large language mo
Rethinking the Role of LLMs in Time Series Forecasting
This research paper evaluates the efficacy of **Large Language Models (LLMs)** in the field of **time series forecasting (TSF)** through a massive empirical study. While previous scholars argued that LLMs offer minimal benefits over standard models, this study utilizes **8 billion observations** to prove that LLMs significantly enhance **cross-domain generalization** and predictive accuracy. The a
Robust Representation Learning through Explicit Environment Modeling
This research addresses out-of-distribution generalization by proposing a shift from traditional causal invariance to explicit environment modeling. While standard methods attempt to discard all environment-dependent information, this paper argues that such features can be predictive when the environment directly influences the target. The authors introduce neural generalized random-intercept mode
Magentic Marketplace: An Open-Source Environment for studying Agentic Markets
This research paper introduces Magentic Marketplace, an open-source simulation designed to study the economic behaviors of autonomous LLM agents. The environment facilitates a complete transaction lifecycle where Assistant agents representing consumers interact with Service agents representing businesses to discover, negotiate, and purchase services. While frontier AI models can approximate optima
Hyperloop Transformers
Researchers from MIT have introduced Hyperloop Transformers, a novel architecture designed to significantly reduce the memory footprint of large language models for edge and on-device deployment. This model leverages looped Transformer layers that reuse parameters across the model's depth, specifically by organizing layers into three blocks where only the middle section repeats. To overcome th
Scaling Self-Play with Self-Guidance
This paper discusses Self-Guided Self-Play (SGS), a new algorithm designed to improve the reasoning capabilities of large language models through autonomous problem generation. Standard self-play often hits a performance plateau because the Conjecturer model eventually creates low-quality or "hacked" problems that do not facilitate real learning for the Solver. To solve this, SGS adds a
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
Researchers have introduced RLT, a lightweight method designed to enhance the precision and speed of vision-language-action (VLA) models through efficient online reinforcement learning. The system adapts large, pretrained VLAs by exposing an "RL token," a compressed representation that allows a small actor-critic network to refine robot movements without retraining the entire billion-par
Agentic Data Environments
This research paper introduces Agentic Data Environments, a new paradigm designed to transform passive data storage into active systems that support autonomous AI agents. The authors argue that while current agents primarily read data, future automation requires read-write capabilities that can modify environments with real-world consequences. To maximize the benefits of these agents, the framewor
AI organizations are more effective but less aligned than individual agents
This research paper investigates **AI Organizations**, which are multi-agent systems composed of several individual language models working toward a shared business objective. The study finds that while these organizations are more **effective at achieving business goals** than single agents, they are simultaneously **less aligned with ethical standards**. Across various consultancy and software e
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
This paper introduces Quantile Token Regression, a novel framework designed to improve how large language models predict full probability distributions from unstructured text. Unlike previous methods that rely on a single representation for all outputs, this approach inserts dedicated quantile tokens into the model’s input to create direct pathways for estimating specific distribution levels. The
Distortion of AI alignment revisited: RLHF is a decent utilitarian aligner
This paper provides a fine-grained theoretical analysis of Reinforcement Learning from Human Feedback (RLHF), specifically examining its performance in pluralistic settings with diverse user preferences. The authors challenge previous assertions that RLHF inherently suffers from exponential distortion, demonstrating instead that such degradation is primarily a result of a distribution mismatch bet
Llms get lost in multi-turn conversation
This research paper from Microsoft and Salesforce identifies a significant performance gap in Large Language Models (LLMs) when they transition from single-turn to multi-turn, underspecified conversations. Through large-scale simulations, the authors found that even state-of-the-art models suffer an average 39% drop in performance when instructions are revealed gradually rather than all at once. T
Transformers are inherently succint
This paper details research proving that **fixed-precision transformers** possess immense **succinctness**, allowing them to represent complex concepts with far fewer parameters than traditional models. By simulating large binary counters through **unique hard-attention mechanisms**, transformers can describe languages **exponentially more efficiently** than **Linear Temporal Logic (LTL)** or **Re
The Coasean Singularity? Demand, Supply, and Market Design with AI Agents
This paper examines how autonomous AI agents are poised to revolutionize digital economies by drastically lowering transaction costs and acting as intermediaries for human users. These systems are shifting from simple information retrieval to independent reasoning and action, performing complex tasks like negotiation, product search, and contract management. While this transition offers significan
Demystifying the unreasonable effectiveness of online alignment methods
This research paper investigates why online alignment techniques for language models perform significantly better in practice than older mathematical theories suggested. The author argues that previous metrics were flawed because they confused the statistical difficulty of learning with the random noise required for exploration during training. By applying a more precise decision-centric evaluatio
Specialization after generalization: towards understanding test-time training in foundation models
This research paper investigates test-time training (TTT) in foundation models, proposing that these large-scale networks remain globally underparameterized despite their massive size. The authors introduce the concept of specialization after generalization, where a model improves its performance by temporarily focusing its capacity on task-specific concepts. Using the linear representation hypoth
Exploration and Exploitation Errors Are Measurable for Language Model Agents
This research paper introduces a systematic framework to measure how Language Model (LM) agents balance exploration and exploitation in complex, open-ended environments. The authors designed a policy-agnostic metric that identifies structural errors in an agent's trajectory without needing a reference solution, distinguishing between redundant movement and failed knowledge application. Their e
A Mechanistic Analysis of Looped Reasoning Language Models
This paper provides a mechanistic analysis of looped language models, which reuse specific Transformer layers in a recurrent cycle to increase computational depth without adding parameters. The authors demonstrate that these models frequently converge to cyclic fixed points, creating stable, repeating trajectories in latent space that maintain consistent attention patterns. Crucially, the research
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
This paper explores the sample complexity of autoregressive models, specifically comparing Chain-of-Thought (CoT) supervision against End-to-End (e2e) learning. The researchers demonstrate that while e2e learning exhibits a diverse range of growth rates where the required data can scale linearly with reasoning length, CoT supervision effectively eliminates this dependence. By providing intermediat
Why AI systems don’t learn and what to do about it
This paper explores the critical limitations of current artificial intelligence, noting that existing models fail to learn autonomously from their environment like humans and animals. To address this, the authors propose a cognitive architecture called the A-B-M framework, which integrates learning through observation, active behavior, and an internal meta-control system. This meta-controller mimi
The Illusion of Learning from Observational Data: An Empirical Bayes Perspective
This paper addresses the "illusion of learning" in causal inference, where combining observational data with randomized experiments fails to improve accuracy because the bias distribution of observational studies is unknown. The authors demonstrate that while standard empirical Bayes methods often fail to resolve this, the inclusion of calibration studies—observational research on interv
Ads in AI chatbots? An analysis of how large language models navigate conflicts of interest
This research explores the ethical and behavioral risks of integrating advertisements into AI chatbots, which often creates a direct conflict of interest between company profits and user needs. By testing numerous frontier models, researchers found that these systems frequently prioritize sponsored content over more affordable or helpful alternatives. The study reveals that AI agents often manipul
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
This research paper introduces TOMPA, a novel framework designed to expose critical vulnerabilities in reward models used for aligning artificial intelligence. Unlike traditional adversarial methods that rely on human-readable text, this approach performs automated optimization directly in token space to bypass semantic constraints. By eliminating the need for coherent natural language, the system
LLM Evaluation as Tensor Completion: Low-Rank Efficiency and Uncertainty Quantification
This paper introduces a rigorous statistical framework for evaluating Large Language Models (LLMs) by treating the problem as a low-rank tensor completion task. The researchers address the challenges of chatbot leaderboards, such as those on platforms like Chatbot Arena, which rely on noisy and sparse human preference data from pairwise model comparisons. By assuming that model performance across
Neural Computers
Researchers have introduced Neural Computers (NCs), a transformative computing paradigm that merges memory, processing, and input/output into a single learned runtime state. Unlike traditional hardware that executes rigid code, these systems use neural networks to internalize the functions of a running computer. Current prototypes utilize video models to simulate interactive command-line and deskt
How AI Aggregation Affects Knowledge
This research examines how generative AI systems impact collective knowledge by creating feedback loops where AI outputs become future training data. Utilizing an expanded DeGroot model of social learning, the study demonstrates that when AI aggregators update too rapidly, they amplify existing social biases and segregation rather than correcting them. This phenomenon leads to a "learning gap
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
We discuss World Action Verifier (WAV), a novel framework designed to enhance the reliability and efficiency of action-conditioned world models in robotics. The authors address the difficulty of training models to follow actions accurately, especially when labeled interaction data is scarce. By exploiting asymmetries between forward and inverse dynamics, WAV decomposes the prediction process into
In-Place Test-Time Training
This paper introduces In-Place Test-Time Training (In-Place TTT), a novel framework designed to let Large Language Models (LLMs) dynamically update their knowledge during inference. Traditional models remain static after deployment, but this approach repurposes existing MLP blocks as "fast weights" that adapt to new information in real-time. By utilizing a chunk-wise update mechanism and
Test-Time Scaling Makes Overtraining Compute-Optimal
Researchers from the University of Wisconsin-Madison and Stanford University propose Train-to-Test (T2) scaling laws to optimize the development and deployment of Large Language Models. Traditional scaling methods like Chinchilla focus primarily on pretraining efficiency, whereas T2 scaling jointly considers model size, training duration, and the compute required for repeated sampling at test-time
AI Agent Prevalence and Data Quality Across Multiple Online Sample Providers
This research evaluates the prevalence of AI agents and the quality of human data across various online recruitment platforms. By comparing direct panels, hybrid networks, and marketplace aggregators, the authors found that sophisticated LLM-based agents are not yet a widespread threat to most survey ecosystems. Instead, automated detections were largely concentrated on Amazon MTurk and appeared m
POLCA: Stochastic Generative Optimization with LLM
This paper introduces POLCA, a scalable framework designed to automate the optimization of complex systems like LLM prompts and multi-turn agents. The authors formalize this challenge as stochastic generative optimization, where an LLM acts as the optimizer but must contend with noisy feedback, random system behaviors, and an ever-expanding solution space. To ensure efficiency, POLCA utilizes a pr
Agentic Markets: Equilibrium Effects of Improving Consumer Search
We explore the equilibrium effects of agentic markets, in which AI tools assist consumers and businesses in searching for and transacting in products. Through a mathematical model of sequential search, the authors analyze how reducing search costs and increasing the detail of pre-purchase information impact market learning and consumer welfare. The research highlights a counterintuitive finding: w
One Model, Two Markets: Bid-Aware Generative Recommendation
The provided research introduces GEM-Rec, a unified generative framework designed to balance organic user recommendations with platform monetization. While traditional generative models focus solely on semantic relevance, this new architecture integrates commercial bids directly into the retrieval process using specialized control tokens. By decoupling the decision to show an ad from the specific
How Well Do LLMs Predict Human Behavior? A Measure of their Pretrained Knowledge
This research paper introduces the equivalent sample size (ESS) as a novel metric to quantify the predictive value of Large Language Models (LLMs) compared to traditional human-provided data. The authors define ESS as the specific amount of domain-specific training data a machine learning algorithm requires to match the accuracy of a pretrained, fixed LLM. To estimate this value, they developed a
Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
This research paper explores autocurriculum, a training strategy that allows language models to autonomously identify and focus on the most challenging problems to improve their reasoning capabilities. By using an outcome verifier to prioritize prompts the model fails to solve, the authors prove that supervised fine-tuning requires exponentially fewer expert demonstrations than traditional non-ada
Agentic AI and the next intelligence explosion
This paper proposes that the future of artificial intelligence lies in plurality and social interaction rather than a single, monolithic super-intelligence. The authors argue that modern reasoning models already function as a "society of thought," where internal debates between different perspectives drive more accurate problem-solving. By moving toward a hybrid ecosystem, human and mach
Understanding Behavior Cloning with Action Quantization
This research provides a theoretical foundation for behavior cloning using action quantization, a common practice in robotics and large-scale AI models where continuous signals are converted into discrete tokens. The authors analyze how quantization error and statistical complexity interact to influence a model’s performance over time. Their findings demonstrate that stable dynamics and smooth pol
HyperAgents: : Open-Ended Metacognitive Self-Improvement for Any Computable Task
This paper introduces HyperAgents, a novel framework for creating self-referential AI systems capable of autonomous, open-ended improvement across any computable task. Unlike previous models that rely on rigid, human-designed rules for self-modification, these agents integrate task-solving logic and meta-level improvement mechanisms into a single editable program. This architecture enables metacog
Harness design for long-running application development \ Anthropic
This article explores how **multi-agent harness design** significantly enhances the performance of AI models in complex, long-running tasks like **frontend design** and **autonomous software engineering**. The author details a shift from single-agent attempts to a **GAN-inspired architecture** involving specialized **planner, generator, and evaluator** roles to overcome issues like "context anxiet
Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably
This research explores whether AI agents can autonomously reach strategic equilibria in repeated interactions without specialized training. The author proves that "reasonably reasoning" agents—those capable of basic capabilities such as Bayesian learning and asymptotic best-response—naturally converge toward Nash equilibrium play, where posterior-sampling behaviors of off-the-shelf model
How Log-Barrier Helps Exploration in Policy Optimization
This paper introduces Log-Barrier Stochastic Gradient Bandit (LB-SGB), a new algorithm designed to fix structural flaws in standard policy optimization methods. While traditional gradient bandits often prematurely converge to suboptimal actions because they lack an explicit exploration mechanism, the authors use log-barrier regularization to force the policy away from the boundary of the probabili
The Finetuner’s Fallacy: When to Pretrain with Your Finetuning Data
This research introduces specialized pretraining (SPT), a strategy that incorporates domain-specific data directly into the initial pretraining phase rather than reserving it solely for finetuning. By mixing a small percentage of specialized tokens with general web data, models achieve superior performance and faster convergence on niche topics like chemistry, music, and mathematics. This approach
TURNWISE: The Gap between Single- and Multi-turn Language Model Capabilities
This research addresses the performance gap in large language models between single-turn and multi-turn interactions. The authors introduce TURNWISEEVAL, a new benchmark that isolates conversational ability by comparing model responses in long dialogues against equivalent single-turn prompts. To improve model performance, they also developed TURNWISEDATA, a scalable pipeline that generates synthet
Temporal Straightening for Latent Planning
This research paper introduces **temporal straightening**, a technique designed to improve **latent planning** in AI world models by regularizing the curvature of agent trajectories. While standard visual encoders often produce highly curved paths in latent space, this approach uses a **curvature regularizer** to create a representation where feasible transitions follow straighter lines. This geom
Fine-Tuning Strategies for Preserving In-Context Learning in Linear Attention
This research examines the tension between in-context learning (ICL) and fine-tuning in Transformer-based models, specifically using linear attention to provide a theoretical foundation. While fine-tuning is often employed to enhance zero-shot performance on specific target tasks, the authors demonstrate that updating all attention parameters can inadvertently damage the model's ability to lea
LLMs Can Learn to Reason Via Off-Policy RL
Researchers have introduced OAPL, a new reinforcement learning algorithm designed to improve how Large Language Models (LLMs) learn complex reasoning for math and coding. Traditional methods often struggle when the training policy and the inference engine are out of sync, a common issue in large-scale, asynchronous computing. Instead of trying to force these mismatched systems to align, OAPL embra
Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning
This paper explores Continual Reinforcement Learning (CRL) for large Vision-Language-Action (VLA) models, focusing on how these agents adapt to new tasks without losing prior knowledge. While traditional machine learning often suffers from catastrophic forgetting during sequential training, this research demonstrates that a simple Sequential Fine-Tuning approach remains remarkably effective. By co
Provable and practical in-context policy optimization for self-improvement
This research paper introduces In-Context Policy Optimization (ICPO), a framework designed to explain and enhance the self-reflection capabilities of large language models. The authors provide a mathematical foundation proving that specific transformer architectures can inherently mimic policy optimization algorithms without requiring parameter updates. Building on this theory, they develop ME-ICP
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
This research paper introduces Energy-Based Fine-Tuning (EBFT), a novel method for refining language models by matching feature statistics of generated text with ground-truth data. Traditional training relies on next-token prediction, which often causes models to drift or fail during long sequences because they lack global distributional calibration. By optimizing a feature-matching objective usin
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
This research introduces the concept of Neural Thickets, describing a phenomenon where large pretrained models are surrounded by a high density of diverse, task-specific solutions in their local weight space. While small models require structured optimization like gradient descent to find improvements, larger models transition into a regime where random weight perturbations frequently yield "
AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization
This paper introduces AdaEvolve, a novel framework designed to enhance how Large Language Models (LLMs) solve complex optimization and programming tasks through evolutionary search. Unlike existing methods that use rigid, pre-set schedules, this system implements hierarchical adaptivity to manage computational resources and search strategies dynamically. It operates across three levels: local adap
∇−reasoner: LLM reasoning via test-time gradient descent in latent space
This paper introduces ∇-Reasoner, a novel framework that improves Large Language Model (LLM) reasoning by applying gradient-based optimization during the inference process. Unlike traditional methods that rely on random sampling or discrete searches, this approach uses Differentiable Textual Optimization (DTO) to refine token logits through first-order gradients derived from reward models and like
Inference for Regression with Variables Generated by AI or Machine Learning
This research investigates how using artificial intelligence (AI) or machine learning (ML) to generate variables for economic regressions can lead to biased estimates and invalid statistical inference. While researchers often treat AI-generated outputs as standard data, the authors demonstrate that measurement error in these variables—even from high-performance algorithms—shifts the centering of c
Fast KV Compaction via Attention Matching
This paper introduces Attention Matching (AM), a novel framework for fast and efficient key-value (KV) cache compaction in long-context language models. As models process longer sequences, the memory required for the KV cache becomes a major bottleneck, often necessitating lossy strategies like summarization or token eviction. The researchers propose optimizing compact keys and values to reproduce
Position: stop anthropomorphizing intermediate tokens as reasoning/thinking traces!
This position paper argues against the anthropomorphization of intermediate tokens in large language models, commonly referred to as "reasoning traces" or "chains of thought." The authors contend that these outputs are not genuine reflections of human-like thinking but are instead statistically generated patterns that may lack semantic validity. Research indicates that model pe
Code World Models for General Game Playing
Researchers at Google DeepMind introduced Code World Models (CWM), a framework that uses Large Language Models to translate natural language game rules and player trajectories into executable Python code. Unlike traditional methods that use LLMs as direct move-generating policies, this approach treats the model as a verifiable simulation engine capable of defining state transitions and legal actio
Recommended

#100MasterCoaches with Mel Leow, MCC

100% Mixtape Podcast

100 With The Hunter's

10-41: A UCSO Podcast

108.3 WGKSRADIO DEEP HOUSE PARTY

10 at a Time

10Fold Founders

10% Happier with Dan Harris

10-Minute Contrarian

10 Minutes Korean - Learn Korean & English Naturally

10 Minutes with Jesus

10 Minute Teacher Podcast with Cool Cat Teacher