Eye on AI Weekly Research Watch

Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection Jun 30, 2026 121 Financial fraud detection in transaction networks faces a fundamental challenge: fraudulent activity is rare, well-disguised, and often underrepresented in labeled data. Standard graph neural networks tend to smooth out the very irregularities that signal fraud. ADC-GNN tackles this with three complementary mechanisms: diffusion-guided feature augmentation that stabilizes node representat

Toward Robust In-Context Segmentation via Concept Guidance Jun 30, 2026 149 In-context segmentation asks a model to identify target regions in new images using only a handful of labeled reference examples — no retraining required. Current approaches work by matching low-level visual features between references and queries, making them brittle when references vary in viewpoint, lighting, or appearance. CG-ICS instead extracts high-level semantic concepts from refe

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models Jun 30, 2026 181 Jailbreak attacks — prompts engineered to make safety-aligned LLMs produce harmful outputs — are a persistent concern, but exactly how they work mechanistically has remained murky. This paper provides evidence that successful attacks don't erase safety representations; they selectively suppress specific "Adversarially Compromised Heads" in early attention layers while leaving "Safety-Alig

Tandem Reinforcement Learning with Verifiable Rewards Jun 30, 2026 139 Reinforcement learning has dramatically improved LLM reasoning on tasks like competition math — but the resulting models often reason in ways that are difficult for weaker models or humans to follow, limiting their real-world utility. Tandem Reinforcement Learning (TRL) addresses this by co-training a strong "senior" model alongside a frozen "junior" model: both contribute to generating r

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association Jun 30, 2026 193 Large-scale studies linking heart imaging measurements to disease risk typically rely on pre-defined, single-variable features chosen by experts — an approach that may miss important non-linear relationships or interactions between measurements. CPAgents automates the discovery of richer, composite phenotypes (ratios, polynomial combinations, interaction terms) through a three-agent loop:

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior Jun 30, 2026 154 Getting multiple AI agents to work together effectively in a shared physical environment is harder than it sounds — agents frequently act on outdated assumptions about their partners or issue redundant, mistimed communications. LLawCo addresses this by having agents reflect on past failures to extract high-level "laws of cooperation," such as knowing when to speak and when to wait, then f

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction Jun 30, 2026 148 Predicting how hard an exam question will be for human test-takers — without running expensive human trials — would transform educational assessment. This paper proposes using the reasoning traces of large language models as a proxy for human cognitive effort. Rather than treating these traces as raw text, Epi2Diff structures them into meaningful "cognitive episodes" — functional states l

The Remittance Blueprint: Data-driven Intelligence for Sri Lanka Jun 30, 2026 163 Remittances — money sent home by migrant workers — are a lifeline for many developing economies, yet surprisingly hard to forecast reliably. This study applies rigorous time-series and machine learning methods to 32 years of Sri Lankan migration and remittance data, finding that external factors like exchange rates and global oil prices drive inflows far more than domestic indicators. A m

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration Jun 30, 2026 160 Building robots that can understand and interact with the physical world requires massive amounts of 3D training data — but capturing that data with multi-camera rigs is expensive and impractical at scale. HAT-4D proposes using ordinary monocular video as a data source, reconstructing the 3D geometry and temporal dynamics of multiple interacting objects with the help of vision-language mo

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives Jun 30, 2026 144 As AI systems increasingly act as proxies for human stakeholders in shared learning environments, a thorny question arises: how do you fairly reward each participant's contribution when different contributors have different values — and when some contributions might violate those values? This paper proposes a framework that filters gradient updates by each principal's value profile before

Exposure Bias Can Alleviate Itself via Directional and Frequency Rectification in Flow Matching Jun 30, 2026 160 Flow matching is a powerful framework for generating images and other data by learning to map noise to structure, but it suffers from a training-inference mismatch: models are trained on clean trajectories but must operate on drifted ones at test time. DEFAR turns this problem on its head, treating the drift itself as a useful signal. It uses the bias to learn corrective directions and to

Govern the Repository, Not the Agent: Measuring Ecosystem-Level Risk in AI-Native Software Jun 30, 2026 174 When AI agents autonomously write and merge code at scale, the usual way of evaluating them — task by task, in isolation — misses something important: the cumulative friction and technical debt that builds up in shared codebases over time. Studying over 930,000 agent-authored pull requests, this paper finds that about half of "integration friction" is a property of the repository ecosyste

How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks Jun 30, 2026 167 Why do bigger neural networks tend to perform better — and by exactly how much? Scaling laws attempt to answer this, but most existing theory relies on simplified assumptions about infinite width or unlimited data. This work studies how generalization error changes as both model width and dataset size vary simultaneously in a tractable two-layer network, revealing a phase diagram with dis

Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation Jun 30, 2026 199 Detecting defects in manufactured goods — a crack in a circuit board, a tear in fabric — requires models that can generalize across wildly different visual conditions. TopoTTA brings an unusual tool to this problem: persistent homology, a mathematical framework that captures the shape and connectivity of structures across scales. Rather than relying on simple pixel-confidence thresholds,

Agent-Native Immune System: Architecture, Taxonomy, and Engineering Jun 30, 2026 195 As AI agents gain the ability to use tools, access memory, and coordinate with other agents, they become vulnerable to entirely new classes of attacks — malicious instructions injected through tool outputs, poisoned memory, or compromised peer agents. ANIS proposes a defense architecture modeled on the biological immune system, embedded directly inside the agent's reasoning process rather

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration Jun 30, 2026 130 Cellular networks in cities are under constant, unpredictable stress — traffic jams, concerts, and commutes all reshape how and where data flows. Predicting this demand accurately is essential for carriers to allocate bandwidth intelligently. PEHT introduces a transformer-based model that separates core network traffic signals from external urban mobility and congestion data, then fuses t

Towards Automating Scientific Review with Google's Paper Assistant Tool Jun 30, 2026 176 The volume of scientific papers being published is growing faster than human reviewers can keep pace with — a crisis accelerated by AI-assisted research generation. This paper proposes a taxonomy of AI-human collaboration levels in peer review, then introduces PAT (Paper Assistant Tool), an agentic system that reads full manuscripts and produces structured evaluations, including checks of

Agentic Hardware Design as Repository-Level Code Evolution Jun 30, 2026 151 Designing computer chips is extraordinarily complex, requiring expertise across logic, timing, and physical layout — making it a compelling frontier for AI automation. HORIZON treats hardware design the same way modern AI treats software: as an evolving codebase that an agent can iteratively improve. By wrapping design tasks in a structured "project pack" with executable evaluators, the a

Which Nash Equilibrium? Solver-Dependent Selection on Zero-Sum Nash Polytopes Jun 30, 2026 189 In competitive games — from poker to cybersecurity — there isn't always a single optimal strategy, but rather a whole family of equally valid equilibria. Which one an AI solver picks can quietly determine how it behaves against opponents who don't play perfectly. This work reveals that the choice of algorithm, not random chance, systematically drives which equilibrium gets selected. Regul

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand Jun 30, 2026 157 Robotic hands capable of dexterous manipulation have made impressive strides, but teaching a single hand to perform multiple tasks simultaneously — without one task undoing another — remains a hard unsolved problem. Imagine a robot hand that already knows how to hold an object securely; asking it to also open a latch might cause it to loosen its grip. DexCompose addresses this by assignin

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation Jun 23, 2026 160 Building a high-quality speech synthesis system typically requires training multiple specialized models independently, then orchestrating them at inference time — an expensive and memory-intensive process. This paper explores a more compact path: starting with a speech classifier already trained to recognize acoustic properties, and attaching a lightweight generative subnetwork that reuse

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions Jun 23, 2026 188 IVF success rates are influenced by countless variables, but the physical conditions inside laboratory incubators — temperature stability, humidity adherence, recovery speed after disturbances — have historically been modeled crudely if at all. This paper demonstrates that richly engineered temporal features from environmental sensors, combined with a hierarchical Bayesian model that pool

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems Jun 23, 2026 157 As AI agents gain access to tools with real-world consequences, attackers have begun automating their jailbreak campaigns — using language models to generate, evaluate, and refine prompts at scale. Standard defenses that simply refuse suspicious inputs inadvertently help attackers by providing clear feedback signals. This paper proposes a counterintuitive alternative: rather than blocking

UltraQuant: 4-bit KV Caching for Context-Heavy Agents Jun 23, 2026 141 Language model agents that maintain long, multi-turn conversations place enormous pressure on GPU memory, primarily because the key-value cache — a stored record of prior context — grows with every exchange. At scale, this becomes a bottleneck that throttles how many users a system can serve simultaneously. UltraQuant attacks this problem with aggressive 4-bit compression of the KV cache,

Optimal Order of Multi-Agent and General Many-Body Systems Jun 23, 2026 165 As AI systems increasingly coordinate in networks — fleets of trading agents, swarms of robotic systems, distributed planning architectures — questions about collective behavior become urgent. When should agents synchronize tightly, and when should they maintain independence? This paper develops a formal framework borrowing concepts from physics and economics, modeling collective outcomes

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems Jun 23, 2026 207 Multi-agent systems that use language models to evaluate each other's outputs are gaining traction in automated research, code review, and content moderation pipelines. But when one agent's bias influences another's, errors can compound silently across the network. This paper formalizes that risk with the Contagion Networks framework, measuring how systematically biased evaluators propaga

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software Jun 23, 2026 165 Security teams are increasingly exploring whether large language models can automatically detect vulnerabilities in source code — a task with serious consequences if done poorly. This paper delivers a sobering assessment: even fine-tuned models that score well on benchmarks may be learning surface-level patterns rather than genuine security reasoning. Using carefully curated Linux kernel

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining Jun 23, 2026 196 Generative image models are increasingly asked to do something cognitively demanding: take the content of one image and the style of another, and fuse them seamlessly without letting either bleed into the wrong dimension. This is harder than it sounds — style references tend to smuggle in unwanted structural or semantic content. FreeStyle approaches this challenge by mining the large comm

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? Jun 23, 2026 141 Jailbreaks via in-context examples are a known vulnerability of language models, but the underlying mechanics have remained murky. Why does showing a model a few harmful exchanges cause it to comply with further harmful requests? This paper dissects the phenomenon carefully, mixing benign and harmful demonstrations to isolate what models actually extract. Surprisingly, benign demonstratio

Efficient and Sound Probabilistic Verification for AI Agents Jun 23, 2026 168 AI agents operating in enterprise environments — browsing the web, calling APIs, reading files — must be constrained by security policies. Prior work on policy enforcement assumed those policies were deterministic, but real tools like PII detectors or content classifiers have inherent failure probabilities. This paper introduces a framework grounded in distributionally robust optimization

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages Jun 23, 2026 181 Code generation benchmarks have become central to how the AI community measures progress, but nearly all of them default to Python — a language that dominates training data and may be inflating model scores. Real software engineering, however, demands fluency across Rust, Go, Java, TypeScript, and many others. Multi-LCB extends the established LiveCodeBench framework to twelve languages w

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS Jun 23, 2026 142 Even the best text-to-speech systems stumble on proper nouns — a product name pronounced wrong in a voice assistant, or a person's name mangled by a navigation system, can undermine trust immediately. Retraining a full TTS model to fix these errors is expensive and slow. FlowEdit offers an elegant alternative: when a correction is provided, it stores a targeted adjustment in an associativ

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes Jun 23, 2026 188 Giving AI agents the ability to modify cloud infrastructure, databases, or deployment pipelines introduces a dangerous gap: a model that reasons incorrectly or gets manipulated could execute irreversible, high-impact actions. Existing security frameworks authorize identities, but they do not enforce that a specific certified action plan is what actually gets executed. The Sovereign Execut

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm Jun 23, 2026 170 Satellite radar imagery sees through clouds and darkness, making synthetic aperture radar (SAR) indispensable for disaster response, military surveillance, agricultural monitoring, and climate research. Yet multimodal AI research has largely been built on optical imagery because aligned, richly annotated SAR datasets have been scarce. SARLO-80 closes that gap, offering over 119,000 matche

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs Jun 23, 2026 199 Standard machine learning tells us what is likely given what we observe, but many real-world decisions demand something more: understanding what would have happened under different circumstances. Counterfactual reasoning is essential for fairness auditing, policy evaluation, and causal explanation. DeepSWIP extends DeepProbLog — a framework blending neural perception with logical reasonin

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents Jun 23, 2026 207 Customer service agents powered by language models must juggle multiple responsibilities simultaneously: tracking conversation state, calling external tools, and obeying domain-specific policies — all without losing their place. Current architectures bury all of this in a flat prompt, forcing the model to reconstruct context from scratch on every turn. LedgerAgent introduces a dedicated s

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech Jun 23, 2026 175 Voice interfaces are increasingly governed by natural language instructions — a user might request speech that sounds "warm and conversational" or "brisk and authoritative." But when a text-to-speech system fails to capture that nuance, diagnosing the problem is largely guesswork. This paper borrows the DAAM attribution framework from image generation and applies it to speech diffusion mo

Toward Calibrated Mixture-of-Experts Under Distribution Shift Jun 23, 2026 202 When a model says it is 80% confident, it should be right about 80% of the time — that is calibration, and it matters enormously in high-stakes settings like medicine, finance, and autonomous systems. Mixture-of-experts architectures, which route inputs to specialized sub-models, have shown strong performance gains, but their calibration behavior under real-world distribution shift has be

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation Jun 23, 2026 151 Recommendation systems quietly shape what billions of people watch, buy, and read. The latest frontier in this space is generative recommendation, which frames next-item prediction as a generation problem rather than a retrieval one. A core challenge is representing user behavior richly enough for a generative model to reason over it without drowning in noise or computational cost. G2Rec

How Transparent is DiffusionGemma? Jun 23, 2026 179 As AI systems take on more consequential roles, understanding how they reason has become as important as what they produce. Diffusion-based language models like DiffusionGemma represent a departure from traditional autoregressive generation, performing much of their computation in a continuous latent space rather than producing tokens step by step. This raises a pressing question: does th

VISTA: View-Consistent Self-Verified Training for GUI Grounding Jun 15, 2026 158 Teaching AI to click the right button on a screen — GUI grounding — sounds simple but is surprisingly brittle. A core training problem is that reinforcement learning often collapses: on hard instances, every rollout fails, so there's no useful learning signal; on easy ones, every rollout succeeds, equally uninformative. VISTA solves this by generating multiple crops of the same GUI screen

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation Jun 15, 2026 143 High-throughput scientific experimentation — screening thousands of chemical compounds, for instance — is expensive and irreversible, making it a dangerous domain for unconstrained AI autonomy. CARE solves this by keeping a proven non-LLM optimizer as the default while allowing an LLM to propose challenger strategies, only authorizing the challenger when pre-outcome evidence actually supp

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems Jun 15, 2026 158 Railway networks are extraordinarily complex — trains of different gauges share limited track, single-track sections require precise coordination, and unexpected disruptions cascade through entire timetables. Most optimization research stops at high-level scheduling, leaving the messy operational details — track switching, gauge compatibility, disruption response — to human operators unde

Sensitivity Shaping for Latent Modeling Jun 15, 2026 170 Generative dynamics models let robots plan behavior in rich, uncertain environments — but safely deploying them requires reliably detecting when the robot is about to enter unfamiliar territory. Existing out-of-distribution detection methods bolt on detectors after the fact, and this paper shows why that fails: if the dynamics model is locally insensitive to different control inputs in cr

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime Jun 15, 2026 153 Most AI failure research is theoretical or laboratory-based — this paper is a rare longitudinal postmortem of a real production LLM agent system running continuously since early 2026, with 22 documented incidents over eight weeks. The most dangerous failure class identified is "fail-plausible": the agent doesn't just fail to report an error, it transforms the error into fluent, convincing

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models Jun 15, 2026 161 Audio AI models have gotten good at recognizing what they hear, but complex reasoning — understanding causation, context, and implication across sound, speech, and music — remains a frontier challenge. A key bottleneck is training data: existing datasets are highly redundant, meaning models see many acoustically similar samples that provide overlapping rather than additive learning signal

Regulating the Machine Contributor: Governance and Policy Alignment in Open Source Jun 15, 2026 166 AI agents can now autonomously plan changes, edit code, and submit pull requests — but open-source infrastructure was built around the assumption of a legally accountable human contributor who can attest to provenance and answer reviewers' questions. This paper systematically maps how six major open-source organizations (including Apache, Linux Foundation, and SymPy) have responded with c

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health Jun 15, 2026 162 Wearables generate a continuous stream of behavioral data — steps, screen time, sleep — that could power truly proactive health interventions, but it's been unclear which AI architectures best handle these signals across diverse populations and time horizons. This study benchmarks six deep learning models plus two foundation models across 800+ participants, tracking forecast accuracy out

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts Jun 15, 2026 137 Predicting how long a patient will survive — and what risks they face — is one of medicine's most consequential tasks, yet most deep learning survival models treat all patients with a single shared representation that can obscure critical subgroup differences. AdaCSM addresses this with a Mixture-of-Experts framework that dynamically routes patients to specialized risk predictors while si

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms Jun 15, 2026 189 What if a musical masterpiece wasn't just art, but also an accidental blueprint for machine learning architectures? This paper argues — through computational analysis of entropy, dissonance, and self-similarity — that the three movements of Beethoven's Moonlight Sonata structurally instantiate streaming, recurrent, and positional encoding memory architectures respectively. The same pitch

When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks Jun 15, 2026 159 Self-improving AI — where a model uses a verifier to generate its own training feedback — sounds like a path to perpetual improvement, but this paper shows it can silently make models worse. The key problem is task specificity: a verifier that accurately scores math problems may perform near-randomly on multi-disciplinary reasoning, and when it does, it feeds the learner confidently wrong

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing Jun 15, 2026 123 Voice synthesis technology has advanced to the point where synthetic speech is nearly indistinguishable from genuine recordings — a serious problem for voice authentication, call centers, and media verification. This paper transforms a self-supervised speech model into a Mixture-of-Experts architecture, where different specialist networks learn complementary acoustic cues for detecting sp

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models Jun 15, 2026 185 Automatic speech recognition models like Whisper are impressively accurate, but when they fail — or when accountability matters — we rarely know why they made a particular decision. LEAF-X introduces a principled explainability framework that uses entropy patterns in attention heads to identify which audio frames most influenced a transcription. It produces sparser, more faithful attribut

Abstracting Cross-Domain Action Sequences into Interpretable Workflows Jun 15, 2026 168 Every click, tab switch, and file save is a data point — but raw interaction logs are too noisy and granular to reveal how people actually work. WorkflowView uses large language models to convert low-level behavioral logs into high-level activity descriptions, achieving strong semantic accuracy in a zero-shot setting. Tested across browser logs, online learning platforms, and Microsoft Wo

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications Jun 15, 2026 160 Cameras aren't just optical devices — they're mechanical ones too, and sound can make them vibrate. This paper demonstrates that audible sound frequencies can resonate commercially available cameras, introducing artifacts that fool AI vision systems like YOLO into misclassifying objects, missing targets, or hallucinating things that aren't there. Unlike prior ultrasonic attacks limited to

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows Jun 15, 2026 179 Modern AI agents increasingly divide complex tasks among parallel sub-agents — one searches, another reasons, another drafts — before a synthesizer merges the results. Today, that merging step wastes enormous computation by converting everything back to text first. Parallel-Synthesis bypasses this bottleneck by letting the synthesizer consume raw KV caches directly from parallel workers,

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification Jun 15, 2026 163 Cotton underpins a massive share of global textile production, yet crop diseases routinely devastate yields in farming communities with limited diagnostic infrastructure. CottonLeafVision applies deep learning — specifically DenseNet201 — to classify seven categories of cotton leaf conditions from field photographs, achieving 98% accuracy. Crucially, the framework goes beyond raw accuracy

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit Jun 15, 2026 150 AI systems paired with proof checkers can now verify mathematical correctness at scale — but verification alone doesn't guarantee value. This paper asks a deeper question: can an AI systematically discover genuinely new, worthwhile mathematics, rather than an endless flood of correct but trivial statements? The authors prove, using formal language theory, that generating non-trivial mathe

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning Jun 15, 2026 143 In the real world, most decisions involve multiple competing goals — reduce emissions and minimize congestion and maximize throughput — and multiple agents who must coordinate to achieve them. Existing multi-agent reinforcement learning often collapses these tensions into a single objective, losing important nuance. PCMA introduces the idea of letting agents develop their own specialized

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning Jun 15, 2026 160 Medical AI assistants are only as trustworthy as their reasoning — and when they hallucinate, the consequences can be life-threatening. Most existing tools for catching hallucinations in medical AI treat errors as a single category, leaving clinicians and developers blind to where reasoning breaks down. ClinHallu addresses this by decomposing the reasoning process into three stages: visua

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Jun 14, 2026 177 When AI systems are evaluated and trained on test suites, there is a persistent temptation — built into the optimization process itself — to exploit loopholes rather than solve problems genuinely. A coding agent that passes tests by hardcoding expected outputs is not a useful software engineer; it is a sophisticated cheater. CapCode proposes a clever structural solution: deliberately desi

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios Jun 14, 2026 192 Focal cortical dysplasia is among the most common causes of drug-resistant epilepsy, yet its subtle MRI signature is frequently missed even by experienced neuroradiologists. Training AI detectors requires large labeled datasets that are extraordinarily difficult to accumulate for rare neurological conditions. This study demonstrates that generative models can produce synthetic MRI scans r

Online Pandora's Box for Contextual LLM Cascading Jun 14, 2026 249 Running multiple AI models and deciding which to query, in what order, and when to stop is an increasingly common engineering challenge. Calling a powerful but expensive model for every query is wasteful; calling a weak model for hard problems is costly in accuracy. This paper formalizes that tradeoff through elegant economic theory, treating each API call as opening a box whose value is

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning Jun 14, 2026 221 The AI field has celebrated chain-of-thought reasoning as evidence that large models are learning to truly think. This paper introduces a more skeptical lens, exhaustively annotating thousands of reasoning steps to ask whether what looks like reasoning actually functions as reasoning. The findings suggest a troubling pattern: models reproduce the structural shape of human mathematical tho

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills Jun 14, 2026 159 Software engineering agents are among the most commercially consequential AI systems being developed today, yet improving them has been constrained by the cost and scarcity of high-quality training tasks. Socratic-SWE turns this problem inside out: rather than sourcing improvement from external data, it mines the agent's own failure history. Every time the agent struggles or succeeds, tha

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs Jun 14, 2026 168 Global deployment of AI raises a persistent concern: do large language models serve non-English-speaking communities as well as English speakers? This study offers a nuanced and somewhat counterintuitive answer. Models may actually encode more cultural knowledge in local languages than raw accuracy scores suggest — the apparent weakness is partly a language proficiency problem, not a know

Watch, Remember, Reason: Human-View Video Understanding with MLLMs Jun 14, 2026 195 Video is the richest and most demanding medium for artificial intelligence — dense with time, space, sound, and implicit human context. This survey organizes the sprawling landscape of video AI research around three intuitive capabilities that humans naturally bring to watching: perception, memory, and inference. By framing the field through this lens, it becomes easier to identify where

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability Jun 14, 2026 180 Functional safety standards for cars were written assuming a human driver who can intervene when something goes wrong. Autonomous vehicles fundamentally break that assumption, yet the industry still largely operates under frameworks designed for human-controlled systems. This paper proposes concrete, auditable extensions to the ISO 26262 standard by introducing two new measurable dimensio

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment Jun 14, 2026 180 Vision-language models like CLIP have become foundational infrastructure for image search, multimodal AI assistants, and content moderation. Yet a persistent frustration is that image embeddings encode far more information than any caption captures, creating a mismatch that degrades retrieval and reasoning. TEVI uses captions as a scalpel rather than a label, selectively suppressing irrel

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams Jun 14, 2026 159 Academic researchers face an overwhelming daily flood of new publications. Static recommendation systems, which treat reading as a one-time ranking exercise, fail to capture how research interests evolve over months and years. PaperFlow models scientific reading the way it actually happens — as a longitudinal process where feedback accumulates and curiosity shifts. By maintaining a living

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle Jun 14, 2026 183 AI systems are increasingly marketed as research assistants capable of literature review, hypothesis generation, and experiment design. But how honestly do existing benchmarks measure genuine research capability versus surface-level task completion? This work argues that current evaluations miss the subtle professional judgment that defines real scientific work — noticing a methodological

Planning-aligned Token Compression for Long-Context Autonomous Driving Jun 14, 2026 197 Safe autonomous driving demands that a vehicle remember not just the last few seconds but extended sequences of interactions — a car that cut in two minutes ago, a pedestrian who paused unexpectedly. Processing all that history at full resolution is computationally prohibitive for real-time systems. COMPACT-VA compresses historical context intelligently, guided not just by recency but by

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders Jun 14, 2026 203 Speech recognition has reached impressive accuracy on human speech, but what happens when a model confidently transcribes silence or background noise as coherent sentences? This hallucination problem in Whisper, a widely deployed transcription system, poses real dangers in medical dictation, legal transcription, accessibility tools, and automated meeting notes. This research demonstrates

Graph Neural Network leveraging Higher-order Class Label Connectivity for Heterophilous Graphs Jun 14, 2026 155 Most graph neural networks were designed with a convenient but often false assumption: that connected nodes tend to be similar. In real-world networks — social platforms, biological interaction graphs, citation networks — this homophily assumption frequently breaks down. Nodes of entirely different types are connected precisely because of their differences. LCC tackles this by capturing r

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification Jun 14, 2026 160 Language is full of expressions whose meaning can't be derived from their parts — idioms, fixed phrases, and culturally embedded constructions that trip up both learners and machines. Turkish presents a particularly interesting case, where idiomatic verb constructions are surface-identical to their literal counterparts. Understanding these distinctions matters for machine translation, lan

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope Jun 14, 2026 179 The shift from AI as a search tool to AI as an autonomous worker represents one of the most significant productivity transitions in modern history. Using real production data, this study quantifies what that shift actually looks like: agents perform dramatically more work per session, complete tasks far faster, and push users toward higher-order thinking rather than routine execution. For

Twelve quick tips for designing AI-driven HPC workflows Jun 14, 2026 195 Scientific computing has traditionally relied on predictable, linear pipelines. AI is disrupting that model entirely, introducing iterative, probabilistic processes that behave very differently from classical workloads. Researchers in genomics, climate science, drug discovery, and astrophysics increasingly need to run large foundation models alongside traditional simulations, but the infr

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning Jun 14, 2026 178 One of the great frustrations in deploying AI systems is that teaching a model something new often erases what it previously knew — a phenomenon called catastrophic forgetting. For AI to be genuinely useful over time, it must accumulate knowledge the way humans do. SETA addresses this by partitioning knowledge into specialized expert modules, ensuring new learning doesn't overwrite old fo

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism Jun 14, 2026 149 As video content explodes across surveillance, medicine, sports analytics, and film, the ability for AI to understand hours-long footage becomes increasingly critical. Current vision-language models choke on extended video because every frame demands processing, creating an unsustainable computational burden. MemDreamer sidesteps this by separating the act of watching from the act of reas

How reliable are LLMs when it comes to playing dice? Jun 14, 2026 176 Probability and statistics form the backbone of countless real-world decisions, from medical diagnoses to financial modeling. This study probes whether large language models can genuinely reason about uncertainty or merely pattern-match their way through standard problems. The findings are sobering: while models excel at textbook-style probability questions, their performance collapses wh

Episodes

Recommended