The CoVar Zeitgeist: June, 2025¶

A larger than normal amount of interesting papers published this month. Featuring:

A method to combine Depth Anything with any sort of absolute depth information to string together absolute depth estimates from monocular imagery.
A paper from MIT that applies a thermodynamic-based approach to understanding LLM/neural net training. Finds a river-valley loss landscape and uses that to guide training methods.
Current approaches towards AI agency focus are focused on empowerment - the ability of an AI agent to influence the future - but Google Deepmind thinks plasticity - the agent’s ability to learn from its past - is equally important. Unfortunately, there’s a tension between these concepts.
Applying RL techniques to LLMs only updates a sparse subset of weights. RLing only on this sparse subset of weights achieves 99% of RLing the entire model.
Proposes Bonsai, a tree-based alternative to UMAP and t-SNE for representing high-dimensional data. Developed for omics data, it generalizes at least to football data.
Google Brain develops a novel RL method using video prediction methods to train models to play Atari.

CoVar

Featured¶

Depth Anything with Any Prior: Develops a method to combine relative depth maps with depth priors that provide absolute depth information to produce accurate absolute depth estimates from monocular imagery.
Neural Thermodynamic Laws for Large Language Model Training: Introduces a novel framework, neural thermodynamic laws, which characterize the LLM training process using analogies to thermodynamics. Posits the existence of a river-valley loss landscape, and uses this to recommend training processes.
Plasticity as the Mirror of Empowerment: Agents have been measured by empowerment: their ability to influence the future. Equally important, this paper argues, is their plasticity: their ability to be influenced by the past. This paper identifies and proves that there is a tension between these two foundational abilities.
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models: A deep dive into the behavior of various reinforcement learning algorithms on improving LLM performance. Finds that each method of RL updates only a sparse subset of weights; moreover, training only on that sparse subset of weights achieves the same effect as training on the entire network.
Bonsai: Tree representations for distortion-free visualization and exploratory analysis of single-cell omics data: Proposes Bonsai, an alternative method to t-SNE and UMAP for representing latent structure in high dimensional data. Bonsai reconstructs a tree relating the high-dimensional data and, among a number of other improvements, is wholly deterministic.
Model Based Reinforcement Learning for Atari: Trains a policy by acting in a world model which performs next-frame and reward prediction. The world model is trained with interaction data from the policy acting in the real environment.

LLMs¶

Large Language Models Are More Persuasive Than Incentivized Human Persuaders: Conducts a study to test the effectiveness of AI as a persuader compared to a human. Found that AI persuaders are more persuasive even when humans are incentivized by economic means.
Circuit Tracing: Revealing Computational Graphs in Language Models: Proposes novel algorithms for divining the underlying behavior of LLMs, and develops a large number of analytic tools to enable such analysis. In a companion paper, applies these methods to Claude 3.5
Harnessing the Universal Geometry of Embeddings: Posits that text embeddings have a universal structure - following the Platonic Representation Hypothesis - and shows this by constructing a method to map text embeddings from different models onto each other without leveraging any paired data or any encoders. Demonstrates robustness to out-of-distribution data.
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs: Sparse Autoencoders are potentially useful for finding interpretable features in neural networks, but can be unreliable in that they can produce sets of features. This paper proposes a metric which prioritizes consistency, and argues that applications following this metric learn consistent semantic features corresponding to ground truth.

LLM Reasoning¶

Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?: Deploys the Blicket test to assess patterns of though in LLMs. Finds that LLMs perform well at discovering disjunctive causal relationships, but fail to discover conjunctive ones.
The Coverage Principle: A Framework for Understanding Compositional Generalization: Notes that LLMs that rely solely on pattern matching struggle to generalize in compositional tasks, and leverages this observation to provide a coverage-based metric for LLM reasoning performance. Finds that there are three ways in which neural networks can generalize: structure-based, property-based, and shared-operator.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning: If a model is capable, it is likely to be correct when it is confident. This paper uses entropy minimization to leverage this insight and force the model to place more probability in its already-confident responses.

Novel Architectures¶

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities: Existing RAG methods are either limited to a textual corpus, or include only a small number of other modalities. This paper proposes UniversalRAG, a RAG framework which can contain a wide variety of diverse object types.
Continuous Thought Machines: Sakana AI proposes a novel neural net architecture, the Continuous Thought Machine (CTM), which is motivated by the desire to make the function of neural nets more similar to how human brains process information. CTMs do so by incorporating neuron-level temporal processing and enabling capture of temporal dynamics while remaining computationally tractable.
PUZZLE: DISTILLATION-BASED NAS FOR INFERENCE-OPTIMIZED LLMS: Presents Puzzle, a framework for LLM inference which operates by utilizing neural architecture search at a large scale. Given a parent model, Puzzle searches a wide number of architectures to find the optimal one for a given task.

Object Detection¶

LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery: Current phrase grounding segmentation capabilities for remote sensing are limited for complex queries. This paper proposes a new method, LISAT, which can handle queries such as “Locate the truck that is elongated and light-colored, diagonally positioned on the road, contrasting with the surrounding darker pavement.”
Depth Anything with Any Prior: Develops a method to combine relative depth maps with depth priors that provide absolute depth information to produce accurate absolute depth estimates from monocular imagery.

Autonomy & Safety¶

Reasoning Models Don’t Always Say What They Think: Anthropic’s Alignment Science Team investigates whether chain of thought accurately represents a model’s reasoning process. Finds equivocal results: Chain of Though often but not always represents the model’s reasoning process, allowing for nontrivial but insufficient monitoring.
Predicting and explaining AI model performance: A new approach to evaluation: Microsoft proposes an evaluation framework for AI models which involves decomposing benchmarks into tasks, grading AI performance according to these tasks to create and ability profile, and using this ability profile to predict future performance. This benchmarking system generalizes more reliably and provides more fine-grained information than current methods.
Plasticity as the Mirror of Empowerment: Agents have been measured by empowerment: their ability to influence the future. Equally important, this paper argues, is their plasticity: their ability to be influenced by the past. This paper identifies and proves that there is a tension between these two foundational abilities.

Reinforcement Learning¶

Absolute Zero: Reinforced Self-play Reasoning with Zero Data: Is it possible to develop a reinforcement learning with verifiable rewards (RLVR) training paradigm that uses zero real examples? To answer, this paper proposes AbsoluteZero, a paradigm where a model (1) proposes tasks to optimize its own learning and (2) learns how to solve these tasks. Despite using zero real data, models post-trained in this paradigm achieve SOTA results.
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models: A deep dive into the behavior of various reinforcement learning algorithms on improving LLM performance. Finds that each method of RL updates only a sparse subset of weights; moreover, training only on that sparse subset of weights achieves the same effect as training on the entire network.
Reinforcement Learning from User Feedback: Proposes Reinforcement Learning from User Feedback (RLUF), a generalization of RLHF which aligns LLMs with user preferences by finetuning them with user feedback.
Spurious Rewards: Rethinking Training Signals in RLVR: Demonstrates that Reinforcement Learning with Verifiable Rewards (RLVR) induces a performance increase in Qwen models even if there are spurious rewards. No other models exhibit this behavior, and the paper speculates it occurs because the RLVR drives code reasoning.
Recurrent World Models Facilitate Policy Evolution: Learns RNN world model from random policy rollouts which can predict the next embedded state given current embedded state and action. Policy trains on embedded observation and world model RNN hidden state.
Model Based Reinforcement Learning for Atari: Trains a policy by acting in a world model which performs next-frame and reward prediction. The world model is trained with interaction data from the policy acting in the real environment.

Training & Continuous Learning¶

DD-Ranking: Rethinking the Evaluation of Dataset Distillation: Dataset distillation methods which involve distilling a large training set into a smaller, synthetic, one for computational purposes have seen advances recently. However, this paper argues that improved metrics are due to improved techniques applied elsewhere in the training pipeline rather than due to image quality, and proposes a new, fairer, evaluation metric.
Neural Thermodynamic Laws for Large Language Model Training: Introduces a novel framework, neural thermodynamic laws, which characterize the LLM training process using analogies to thermodynamics. Posits the existence of a river-valley loss landscape, and uses this to recommend training processes.

Statistics¶

Generate-then-Verify: Reconstructing Data from Limited Published Statistics: Differential privacy is a data privacy method which allows public datasets to be released without allowing for recovery of full information about any individual in the dataset. This paper devises an attack method for differentially private datasets which allows for the guaranteed recovery of some subset of individuals in the dataset.
Bonsai: Tree representations for distortion-free visualization and exploratory analysis of single-cell omics data: Proposes Bonsai, an alternative method to t-SNE and UMAP for representing latent structure in high dimensional data. Bonsai reconstructs a tree relating the high-dimensional data and, among a number of other improvements, is wholly deterministic.
Modular Jump Gaussian Processes: Jump Gaussian Processes (JGPs) were developed to model processes with sudden, nonstationary, jumps but are difficult to apply in practice. This paper proposes a new paradigm for JGPs which enables easier inference.
Random irregular histograms: The Norwegian Defence Research Establishment publishes a novel Bayesian approach to irregular histograms which excels at mode detection for larger sample sizes.

Applications¶

FaceAge, a deep learning system to estimate biological age from face photographs to improve prognostication: a model development and validation study: Trains and evaluates a neural network for predicting age based on facial imagery. In doing so, finds that the model is statistically and clinically useful for evaluating and predicting cancer outcomes.
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms: Google Deepmind releases AlphaEvolve, an agent combining together multiple models to design novel coding algorithms. Algorithms designed by AlphaEvolve are currently deployed at Google, saving Google 0.7% of its world-wide computing resources. AlphaEvolve has also designed other novel algorithms such as finding a more optimal method for multiplying complex-valued 4x4 matrices, improving upon a result that had not admitted innovations for 60 years.
Unlocking Non-Invasive Brain-to-Text: Proposes a novel, SOTA, paradigm for non-invasive brain-to-text (B2T) prediction leveraging (1) contextual LLM rescoring, (2) a predictive fill-in strategy, and (3) selective dataset pooling.
ROBIN: A MULTI-AGENT SYSTEM FOR AUTOMATING SCIENTIFIC DISCOVERY: Introduces Robin, a multi-agent system combining literature search and data analysis agents to automate parts of the scientific pipeline. Robin generated hypotheses, experiment designs, data analytics, and figures for this paper in the process of discovering a novel therapeutic candidate.
Zochi Publishes A* Paper: Intology has designed an AI agent, Zochi, which has written a paper that has passed peer review at the main proceedings of ACL. The paper written was based off of earlier work submitted to ICLR and took Zochi only days to complete with minimal input from human researchers.

New Models¶

Phi-4-reasoning Technical Report: Microsoft releases Phi-4-Reasoning, a 14B parameter model that achieves comparable performance to larger models such as DeepSeek-R1-Distill-Llama-70B.
Xiaomi MiMo: Xiaomi releases a suite of models, MiMo-7B, that, while small, outperform larger 32B parameter models on mathematical and coding tasks. Available on Huggingface under an Apache 2.0 license.
DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition: DeepSeek releases DeepSeek-Prover-V2, an open source model optimized for formal theorem proving. Available on Huggingface.
Amazon Nova Premier: Technical Report and Model Card: Amazon releases Nova Premier, the most recent and best performing model in the Nova suite of models. It features a one-million token long context window. Available in Amazon Bedrock.
Llama-Nemotron: Efficient Reasoning Models: Nvidia releases the Llama-Nemotron suite of models, a group of reasoning models which achieve SOTA performance at lower computational cost. Available on Huggingface under a bespoke license.
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks: A small, 3B parameter, visual-language-action model from Declare Lab and Lambda Labs. Available under an MIT license.
FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery: FutureHouse releases a suite of four models built from the ground up which can aid scientists in a variety of scientific tasks. Available via API.
OLMo 2 1B: Allen AI releases Olmo 2 1B, the smallest member of the Olmo 2 family, which has been trained to SOTA performance. Available on Huggingface under an Apache 2.0 license.
Medium is the new large.: Mistral releases Mistral Medium 3, which balances SOTA performance with lower cost for coding and multimodal understanding.
PANGU ULTRA MOE: HOW TO TRAIN YOUR BIG MOE ON ASCEND NPUS: Huawei releases Pangu Ultra MoE, a version of Pangu Ultra incorporating Mixture of Experts architecture and which is trained on ascend neural processing units.
MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining: Xiaomi presents MiMo-7B, a small reasoning model which has been trained to SOTA performance on reasoning tasks, exceeding the performance of o1-mini. Code available on github under an Apache-2.0 license.
INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning: Prime Intellect releases Intellect 2, the first 32B parameter model trained via globally distributed reinforcement learning. Available on Huggingface under an Apache 2.0 license.
Qwen3 Technical Report: Qwen releases the technical report for Qwen3: it presents, summarizes, and describes all models in the Qwen3 family in one location.
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput: Meituan releases Flash-VL 2B, a novel architecture for video language models optimized for low latency and high throughput while maintaining accuracy.
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning: A new vision language model which aims to solve mathematical problems presented in image data by first converting the image data to coding problems before applying reasoning capabilities.
SWE-1: Our First Frontier Models: Windsurf releases SWE-1, a suite of models which aims to reproduce the entirety of the work that software engineers perform, not simply writing code. Available via API.
II-Medical: Intelligent Internet releases II-Medical, a new model finetuned to medical applications which can outperform much larger frontier models on clinical reasoning tasks. Available on Huggingface.
Introducing Codex: OpenAI releases Codex, a software engineering agent based off of o3 which specializes in coding tasks. Available via API.
Google I/O 2025: From research to reality: Google releases a plethora of products at Google I/O 2025, including the Veo3 video generation model, Imagen4 for image creation, Lyrai2 for music creation, updates to search, and updates to Gemini 2.5.
Devstral: Mistral releases Devstral, an agentic LLM model finetuned for software engineering tasks. Achieves SOTA performance. Available under an Apache 2.0 license.
Emerging Properties in Unified Multimodal Pretraining: Bytedance releases BAGEL, a SOTA open-source model which unifies understanding and generation. Available under an Apache 2.0 license.
Document AI, powered by the world’s best OCR.: Mistral releases an AI agent for document processing, capable of parsing text, handwriting, tables, and images from any document at 2000 pages per second.
PANGU LIGHT: WEIGHT RE-INITIALIZATION FOR PRUNING AND ACCELERATING LLM’S: Huawei releases Pangu Light, a framework for enhancing computational performance of LLMs without suffering a large performance drop. Demonstrates success on Qwen3-32B.
Introducing Claude 4: Anthropic releases Claude Opus 4 and Claude Sonnet 4, the newest releases in the Claude family. Claude Opus 4 is a dedicated coding agent which is the “best in the world”, while Claude Sonnet 4 is an upgrade to Claude Sonnet 3.7. Available via API.
PANGU PRO MOE: MIXTURE OF GROUPED EXPERTS FOR EFFICIENT SPARSITY: Huawei releases Pangu Pro MOE, a mixture of grouped experts model which more effectively balances the load between the experts than classical MOE models.
DeepSeek-R1-0528: DeepSeek releases a minor update to R1. Available on Huggingface under an mit license.