The CoVar Zeitgeist: May, 2025¶

There is a lot of interesting research this month. Featuring:

A novel sparse attention module which achieves the theoretical maximum possible speedup.
An investigation of grokking which finds that grokking can be accelerated by leveraging embeddings from a smaller and weaker model.
A Yee Whye Teh paper which proposes a gradient-free learning method for neural networks based on diffusion literature.
A paper which argues that existing benchmarks and evaluations for LLM reasoning are suboptimal and proposes a reasonable alternative.
An examination from Google about difficulties encountered in optimizing machine translation methods for (1) preserving the semantic information of the source text and (2) generating natural sounding language simultaneously. Optimization methods cannot serve two masters.
A nature paper proposing a novel reinforcement learning algorithm that can generalize to many tasks. Claims to be the first model to successfully collect diamonds in Minecraft “from scratch”.

CoVar

Featured¶

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light: Develops a class of sparse attention mechanisms focussing on locality, particularly the Generalized Neighborhood Attention (GNA) which features strided and unstrided sliding windows as well as blocked attention. Realizes the theoretical maximum speedup while maintaining performance.
LET ME GROK FOR YOU: ACCELERATING GROKKING VIA EMBEDDING TRANSFER FROM A WEAKER MODEL: Investigates grokking, where a neural network quickly transitions from poor-to-high performance after a long period of training. Finds that grokking can be accelerated by (1) training a small model to achieve some non-optimal performance, (2) extracting the input embedding, and (3) initializing a larger model at this embedding.
NOPROP: TRAINING NEURAL NETWORKS WITHOUT BACK-PROPAGATION OR FORWARD-PROPAGATION: Proposes a back propagation-free method for neural networks, NoProp, which is based on the denoising score matching approach from the diffusion model literature. Claims that this leads to better performance and less training time than traditional back propagation methods.
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility: Asserts that much recent research in LLM reasoning, in particular on math benchmarks, lacks rigor and is sensitive to a large number of variance-causing factors such as random seeds, prompt choices, and hardware configuration. Verifies this empirically and proposes a unified testing framework for future use.
You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation: Machine translation seeks to (1) accurately translate the meaning of the source text and (2) appear natural in the target language. This paper proves, using recent information theory techniques, that single score summaries cannot adequately capture performance on both tasks simultaneously. Advocates comparisons in the accuracy-naturalness plane instead.
Mastering diverse control tasks through world models: Proposes Dreamer v3 - a general algorithm that that performs well on many different RL tasks under the same hyperparameter configurations. Trains a world model which predicts future state representations and rewards, which is leveraged to train a policy on “imagined” data. Is the first model to collect diamonds in Minecraft from scratch.

LLMs¶

Scaling Laws for Native Multimodal Models: Conducts extensive experiments to derive scaling laws for native multimodal LLMs (NMMs): LLMs that were trained from scratch on all modalities.
Why do LLMs attend to the first token?: Investigates why LLMs place a disproportionate amount of attention on the first token. Hypothesizes that this is to prevent over-mixing, and investigates empirically.
Sleep-time Compute: Beyond Inference Scaling at Test-time: Test-time compute has become a common method to enable language models in practice, but requires significant computations at test-time. This paper proposes sleep-time compute: the language model can predict queries and perform pre-test time compute.
TTRL: Test-Time Reinforcement Learning: Explores how to effectively perform reinforcement learning on data without labels. Develops methods which leverage the information already contained in pre-trained LLMs to bootstrap labels and enable post-training on unlabelled data.

LLM Reasoning¶

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility: Asserts that much recent research in LLM reasoning, in particular on math benchmarks, lacks rigor and is sensitive to a large number of variance-causing factors such as random seeds, prompt choices, and hardware configuration. Verifies this empirically and proposes a unified testing framework for future use.
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining: How much does post-training RL matter for mathematical reasoning? This paper investigates by examining the entire training pipeline, end-to-end. Reports many findings, including that RL fine-training tends to converge towards one distribution observed during training.
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning: Proposes a framework for converting diffusion-based LLMs into reasoning models by using supervised fine-tuning and RL algorithms. This framework leads to SOTA performance for reasoning dLLMs.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models: Investigates reasoning capabilities of LLMs by compiling a high quality set of physics questions, prompting LLMs for answers, and measuring the correctness of answers by finding the graph edit distance between the LLM response and the correct answer.
Optimizing Language Models for Inference Time Objectives using Reinforcement Learning: Directly optimizes language models for inference time using reinforcement learning. Can improve performance for pass@k and majority voting.

Novel Architectures¶

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction: Examines LLM performance on a suite of tasks which require stochastic planning. Finds that current architectures are limited at these tasks, and proposes a novel transformer architecture which (1) plans multiple tokens ahead and (2) injects noise into the input layer rather than rely on temperature.
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light: Develops a class of sparse attention mechanisms focussing on locality, particularly the Generalized Neighborhood Attention (GNA) which features strided and unstrided sliding windows as well as blocked attention. Realizes the theoretical maximum speedup while maintaining performance.

Object Detection¶

Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation: Building a foundation model for SAR earth observation data - as well as building foundation models in general - is highly dependent on the data used in training. This paper proposes a dynamic pruning strategy to prune strongly redundant datasets.

Autonomy & Safety¶

Attack-Defense Trees with Offensive and Defensive Attributes (with Appendix): Attack-defense trees provide a method to analyze attacker-defender strategies in cybersecurity problems. This paper incorporates defender resources into such analysis to improve accuracy.
Expected Free Energy-based Planning as Variational Inference: Expected Free Energy minimization methods have potential for AI agents to employ in the explore-exploit dilemma but face computational issues. This paper proposes a variational inference method which is tractable.
Scaling Laws For Scalable Oversight: Can a weak AI model effectively provide oversight to a stronger AI model? This paper investigates and finds that such a practice is unreliable.
Cognitive swarming in complex environments with attractor dynamics and oscillatory computing: Investigates swarming behavior by comparing a swarm of autonomous agents as a whole to a single entity with a hippocampus model. In this paradigm, individual agents are analogous to individual neurons.

Computational Efficiency¶

Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression: Introduces a novel quantization technique which seeks to preserve performance at specific tasks by contrasting normal weights to uniformly quantized weights and using the gradient to predict expected task degradation. With 3.1 bits, a model quantized in this manner can maintain 96% of the performance of Llama-3-8B-Instruct.
Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs: Designs a novel scheme for putting deep neural nets on FPGAs leveraging the semi-structured sparsity in DNNs.

Reinforcement Learning¶

AssistanceZero: Scalably Solving Assistance Games: Assistance games are an alternative to RLHF where a human and an assistant play together to complete a goal known only to the player. This paper develops a scalable approach to assistance games and applies it to a Minecraft setting.
Mastering diverse control tasks through world models: Proposes Dreamer v3 - a general algorithm that that performs well on many different RL tasks under the same hyperparameter configurations. Trains a world model which predicts future state representations and rewards, which is leveraged to train a policy on “imagined” data. Is the first model to collect diamonds in Minecraft from scratch.
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?: Finds that reinforcement learning with verifiable rewards (RLVR) does not improve the capabilities of an LLM for reasoning, but instead influences the LLM to be more likely to select paths it always had the capacity for.

Training & Continuous Learning¶

NOPROP: TRAINING NEURAL NETWORKS WITHOUT BACK-PROPAGATION OR FORWARD-PROPAGATION: Proposes a back propagation-free method for neural networks, NoProp, which is based on the denoising score matching approach from the diffusion model literature. Claims that this leads to better performance and less training time than traditional back propagation methods.
Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning: Continual learning - that is, learning new information while remembering old information - can pose challenges for LLMs. This paper proposes an SVD-based method where a subspace storing critical information is identified and updates are made orthogonally to this space.
Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks: Investigates using gradient descent to train a convolutional neural network (CNN) in the presence of background image noise. Finds that the CNNs learn the dimension of the noiseless data.
LET ME GROK FOR YOU: ACCELERATING GROKKING VIA EMBEDDING TRANSFER FROM A WEAKER MODEL: Investigates grokking, where a neural network quickly transitions from poor-to-high performance after a long period of training. Finds that grokking can be accelerated by (1) training a small model to achieve some non-optimal performance, (2) extracting the input embedding, and (3) initializing a larger model at this embedding.

Conformal Prediction¶

LEAVE-ONE-OUT STABLE CONFORMAL PREDICTION: Proposes a novel method for implementing conformal prediction - Leave-One-Out Stable Conformal Prediction - which is faster and more stable than existing methods. Derives some theoretical properties of the new method.

Statistics¶

When do Random Forests work?: The application of random forests involves two operations: bagging and split randomization. This paper provides a detailed exploration of the positive effects of the latter.
You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation: Machine translation seeks to (1) accurately translate the meaning of the source text and (2) appear natural in the target language. This paper proves, using recent information theory techniques, that single score summaries cannot adequately capture performance on both tasks simultaneously. Advocates comparisons in the accuracy-naturalness plane instead.
Behavior of prediction performance metrics with rare events: Investigates the efficacy of AUC as a performance metric for rare events. Finds that poor performance is correlated with minimum class size rather than small event rate and that AUC is reliable so long as datasets are reasonably well-sized.

Applications¶

Large Language Models Pass the Turing Test: Puts LLMs to the Turing test. Finds that the most advanced LLMs can pass, achieving 76% win rates, if they are prompted correctly.
SCENT: Robust Spatiotemporal Learning for Continuous Scientific Data via Scalable Conditioned Neural Fields: Brookhaven National Laboratory develops a transformer-based method for spatiotemporal learning for predicting, e.g., pollution, which is capable of “joint interpolation, reconstruction, and forecasting”. Outperforms SOTA methods.
Gaussian Processes at the Helm(holtz): A More Fluid Model for Ocean Currents: Develops novel kernels for Gaussian Processes in order to model the movement of ocean buoys based on ocean currents.

New Models¶

Gemini 2.0 Flash: Google releases Gemini 2.0 Flash, a fast version of Gemini 2.0 which maintains performance. Available via API.
KIMI-VL TECHNICAL REPORT: Kimi releases Kimi-VL, a mixture of experts model with advanced multimodal reasoning capabilities and long context. Available at Huggingface under an MIT license.
Llama-3.1-Nemotron-Ultra-253B-v1: NVIDIA releases Llama-3.1-Nemotron-Ultra-253B-v1, a model distilled from Llama-3.1-Nemotron-405B-Instruct, which offers a competitive tradeoff between computational efficiency and efficacy. Available on Huggingface under a bespoke license.
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level: Agentica and Together AI release DeepCoder-14B-Preview, a reasoning model distilled from Deepseek-R1-Distilled-Qwen-14B via distributed RL. Available under an MIT license.
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation: Meta releases the newest herd of Llama models, Llama 4, offering a variety of models in different weight classes which achieve SOTA performance. Available under a bespoke license.
Cogito v1 Preview Introducing IDA as a path to general superintelligence: Stealth startup DeepCogito releases a set of LLMs at the 3B, 8B, 14B, 32B and 70B sizes - trained using iterated distillation and amplification - and claims them to be the strongest in their weight class. Available at Huggingface under a variety of licenses.
Deep Research is now available on Gemini 2.5 Pro Experimental.: Google has added Deep Research capabilities to the existing Gemini 2.5 Pro Experimental model. Available via API.
Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning: ByteDance introduces Seed-Thinking-v1.5, a mixture of experts model which achieves or surpasses SOTA performance on reasoning and math tasks.
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model: ByteDance introduces SeaWeed-7B, a video generation model trained with novel techniques which can compete with other SOTA models.
Introducing GPT-4.1 in the API: OpenAI releases ChatGPT-4.1, which achieves superior performance to ChatGPT-o4 and ChatGPT-4.5. Available via API.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models: A team of Chinese researchers release InternVL3, a new multi-modal LLM trained that achieves SOTA performance. Available on Huggingface under an MIT license.
PANGU ULTRA: PUSHING THE LIMITS OF DENSE LARGE LANGUAGE MODELS ON ASCEND NPUS: Huawei releases Pangu Ultra, a SOTA LLM trained on Ascend Neural Processing Units (NPUs) without any NVIDIA hardware. Available for commercial customers only.
KIMINA-PROVER PREVIEW: TOWARDS LARGE FORMAL REASONING MODELS WITH REINFORCEMENT LEARNING: Kimi releases Kimina-Prover-Preview, an LLM tuned for formal reasoning via a reinforcement learning process which emphasizes formal reasoning patterns. Available on Huggingface under an Apache 2.0 license.
BitNet b1.58 2B4T Technical Report: Microsoft releases an open-source, native 1 bit LLM. Achieves SOTA performance with only 2B parameters. Available on Huggingface under an MIT license.
Introducing OpenAI o3 and o4-mini: The newest from OpenAI. The new models can “agentically use and combine every tool within ChatGPT” enabling the solving of ever more complicated problems. Achieves SOTA performance on math, coding, and visual tasks. Available via API.
Developers can now start building with Gemini 2.5 Flash.: Google releases Gemini 2.5 Flash, providing an increase in capabilities compared to Gemini Flash 2.0 while maintaining the low computational overhead. Available via the Gemini app.
Advancing AI systems through progress in perception, localization, and reasoning: Meta FAIR releases a suite of embedding models for, among other things, 2D and 3D object detection. Papers and code available for each model.
Gemma 3 QAT Models: Bringing state-of-the-Art AI to consumer GPUs: Google releases Gemma 3 QAT, a suite of Gemma 3 models which have been optimized with quantization aware training to maintain performance while reducing footprint. Gemma 3 27B can run on an NVIDIA RTX 3090. Available on Huggingface under a bespoke license.
Convolutional Multi-Hybrids for Edge Devices: Liquid introduces Hyena Edge, a suite of models using a convolutional architecture and optimized for deployment on edge devices such as phones. Open source.
Qwen3: Think Deeper, Act Faster: Alibaba releases Qwen 3, the newest and best performing members the Qwen suite of models. Qwen 3 utilizes hybrid approach to problem solving, using thinking and no-thinking modes as appropriate. Available under an Apache 2.0 license.