The CoVar Zeitgeist: April, 2025¶
A curated list of the latest research in AI/ML.
Featured¶
- Data representation with optimal transport
Describes a family of non-linear transforms based on optimal transport (OT). Such transforms applied to signals can help solve classification, estimation, and reconstruction problems.
- How To Make Your Cell Tracker Say “I dunno!”
Discusses how to implement tracking in a paradigm where there is genuine uncertainty about how detections should be assigned to tracks: tracking cells in live-cell microscopy. Proposes two solutions: a Bayesian tracking methodology and classification-based tracking methodology. Methods could be adapted for other low-framerate applications.
- Differentiable Logic Cellular Automata
Researchers combine Neural Cellular Automata and Differentiable Logic Gates to learn local rules which, when implemented in a setting similar to Conway’s game of life, can recreate desired patterns. Worth a read.
- Do You Understand Epistemic Uncertainty? Think Again! Rigorous Frequentist Epistemic Uncertainty Estimation in Regression
Model uncertainty can arise from two sources of noise: underlying irreducible noise in the data and model uncertainty. This paper proposes a method to measure the second by having a model double-guess its own output and measuring how sure it is.
- Layer by Layer: Uncovering Hidden Representations in Language Models
Traditionally, the last layers of LLMs have been used to define outputs. However, this paper shows that earlier layers can encode richer representations than outer layers, and demonstrates why.
- Tracing the thoughts of a large language model
Anthropic does a deep dive into the inner workings of their premier LLM, Claude, to analyze how it thinks. Finds a “conceptual space” that is shared between languages, evidence that Claude thinks multiple words ahead instead of just one token ahead, and that Claude will sometimes engage in motivated reason when prompted by the user.
LLMs¶
- Forecasting Rare Language Model Behaviors
Some risks with LLMs are rare and emerge only at scale. This paper fits a model to find the probability that a query will return an undesirable result, at scale, and verifies its efficacy in production.
- Layer by Layer: Uncovering Hidden Representations in Language Models
Traditionally, the last layers of LLMs have been used to define outputs. However, this paper shows that earlier layers can encode richer representations than outer layers, and demonstrates why.
- Measuring AI Ability to Complete Long Tasks
Proposes a new method for LLM evaluation, “how long does it take a human to complete a task that an LLM can complete with at least 50% accuracy”. Uses this to evaluate LLM performance over time, finding that, according to this metric, LLM performance has been doubling every 7 months for the last 6 years.
- Inside-Out: Hidden Factual Knowledge in LLMs
Proposes a definition of knowledge, and develops a method of querying LLMs to ascertain whether LLMs know knowledge but will not communicate it. Finds that LLMs often know things perfectly, but do not include such knowledge in responses. This places theoretical limits on what can be gleaned from scaling test-time compute.
- Tracing the thoughts of a large language model
Anthropic does a deep dive into the inner workings of their premier LLM, Claude, to analyze how it thinks. Finds a “conceptual space” that is shared between languages, evidence that Claude thinks multiple words ahead instead of just one token ahead, and that Claude will sometimes engage in motivated reason when prompted by the user.
LLM Reasoning¶
- A Review of DeepSeek Models’ Key Innovative Techniques
Reviews the key techniques driving the performance of DeepSeek-V3 and DeepSeek-R1 as well as identifying open questions and areas for potential improvement.
- MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams
Finds the multi-modal LLMs struggle to understand mathematical diagrams and have particular trouble with grounding. Proposes a new dataset which, when used to finetune, improves model performance on these task. This paper has a coauthor from a university affiliated with the PLA.
Novel Architectures¶
- xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
The first paper to scale xLSTM, this introduces a 7B billion parameter LLM based off of the xLSTM architecture. Performs equivalently to similar-sized LLMs while taking only a fraction of the time.
- Transformers without Normalization
Proposes a transformers architecture without normalization layers by using an element-wise tanh operation as a drop-in replacement which can match or succeed the performance of standard transformers.
- RWKV Language Model
The Receptance Weighted Key Value (RWKV) is a hybrid architecture fusing together elements of RNNs and Transformers. The most current version, RWKV-7, achieves SOTA performance with linear time. This website aggregates recent RWKV research.
Object Detection¶
- Passive Sonar Sensor Placement for Undersea Surveillance
Researchers associated with the Australian military devise a method for locating passive sonar sensors so as to optimize their probability of detection over a given area.
- How To Make Your Cell Tracker Say “I dunno!”
Discusses how to implement tracking in a paradigm where there is genuine uncertainty about how detections should be assigned to tracks: tracking cells in live-cell microscopy. Proposes two solutions: a Bayesian tracking methodology and classification-based tracking methodology. Methods could be adapted for other low-framerate applications.
- SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World
Develops an algorithm to fuse street-level and satellite-level sensors which resolves temporal and viewpoint based differences to predict 3D object occupancy.
- Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
Develops a method to estimate camera velocity and other IMU-like measurements for blurred imagery during fast motion. Supplements key-point based approaches.
- EventFly: Event Camera Perception from Ground to the Sky
Outlines an interesting approach to sensor fusion from multiple platforms such as vehicles, UAVs, and quadrapods where all of the sensors are event cameras.
Signals Processing¶
- Data representation with optimal transport
Describes a family of non-linear transforms based on optimal transport (OT). Such transforms applied to signals can help solve classification, estimation, and reconstruction problems.
Rendering¶
- GFlow: Recovering 4D World from Monocular Video
Proposes a method for 3D gaussian splatting with moving objects. Uses video with optical flow to estimate which objects are moving.
- ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Introduces a generative model that can re-render an already existing video from new and dynamic camera angles.
Autonomy & Safety¶
- Detecting misbehavior in frontier reasoning models
Finds that directly optimizing Chain of Thought to follow specific behaviors (e.g., “don’t think about elephants”) may cause a model to hide its intent rather than eliminate the undesired behavior. Advocates for leaving CoT uncensored, to better monitor model behavior.
- AUDITING LANGUAGE MODELS FOR HIDDEN OBJECTIVES
Conducts a study of whether a blue team can deduce a hidden objective which has been placed inside an LLM pre-trained to fool reward modellers in RLHF. Three out of four blue teams successfully found the hidden objective.
- EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments
Seeks to score LLM agent behavior in a variety of economic settings, with a focus on measuring how well LLMs can optimize. Introduces the concept of litmus score which, rather than measure how “correct” an LLM’s choice is, instead measures how it tends to behave when confronted with tradeoffs.
Reinforcement Learning¶
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Why do frontier models complete their post-training with a complicated two-stage reinforcement learning procedure rather than directly optimizing parameters? This paper investigates, and finds that increased performance derives from searching over policies which are easy to verify.
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
ByteDance open-sources their state-of-the-art large-scale RL system for improving LLMs, and presents an example where they use it to finetune Qwen2.5-32B for performance on the AIME.
- What Makes a Reward Model a Good Teacher? An Optimization Perspective
Examines reward model dynamics in RLHF. Finds that reward policies, even accurate ones, can induce slow optimization if they induce low reward variance. Moreover, reward model variance can shift between language models, indicating that the optimal reward model depends on the model being trained.
- Understanding R1-Zero-Like Training: A Critical Perspective
Investigates the driving force behind the performance of R1-like models by investigating the base model and the reinforcement learning at scale processes separately. Finds a bias in the Group Relative Policy Optimization (GRPO) which encourages long responses for incorrect answer, and proposes a modified method, Dr. GRPO, to fix this.
Theory¶
- A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
Transformers struggle with sequential reasoning problems with long input lengths, because their computational depth is bounded. This paper derives theoretical scaling laws for depth in transformers, and finds that it matches practical results.
- When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective
Proves via a statistical analysis that, even without infinite compute, transformers are more efficient than feed-forward and recurrent neural networks because a sufficient number of attention heads let it implement dynamic sparsity.
- Do You Understand Epistemic Uncertainty? Think Again! Rigorous Frequentist Epistemic Uncertainty Estimation in Regression
Model uncertainty can arise from two sources of noise: underlying irreducible noise in the data and model uncertainty. This paper proposes a method to measure the second by having a model double-guess its own output and measuring how sure it is.
Applications¶
- Matrix Factorization for Inferring Associations and Missing Links
Proposes a novel matrix factorization method for Boolean matrix factorization, with applications to, among other things, knowledge graph link prediction. Available on github.
- Differentiable Logic Cellular Automata
Researchers combine Neural Cellular Automata and Differentiable Logic Gates to learn local rules which, when implemented in a setting similar to Conway’s game of life, can recreate desired patterns. Worth a read.
- The AI Scientist Generates its First Peer-Reviewed Scientific Publication
Sakana AI had an AI agent write many scientific papers from conception to conclusion, including generating hypotheses and carrying out experiments. One of the papers was accepted at an ICLR workshop, ranking in the 45th percentile of all papers.
- Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs
Datasets can be clustered in different ways: instances of fruit, for example, can be grouped by color or by species. This paper builds an agent-centric algorithm leveraging multi-modal LLMs that can cluster datasets according to different criteria based on user preferences.
New Models¶
- Introducing GPT-4.5
OpenAI releases ChatGPT-4.5, a “research preview” of the newest ChatGPT. Development focussed on scaling unsupervised learning rather than reasoning. Available via API.
- Hunyuan-TurboS
Tencent releases Hunyuan TurboS, a fast-thinking model which prioritizes speed of response while maintaining SOTA performance. Available via API.
- Aya Vision: Expanding the Worlds AI Can See
Cohere releases Aya Vision, a multimodal foundation model which spans 23 different languages. Available via HuggingFace under a bespoke license.
- QwQ-32B: Embracing the Power of Reinforcement Learning
Alibaba releases QwQ-32B, a model which achieves similar performance to DeepSeek despite having only 32B parameters. Available via HuggingFace under an Apache 2.0 license.
- Hon Hai Research Institute Launches Traditional Chinese LLM With Reasoning Capabilities
Foxconn releases the first traditional chinese LLM, FoxBrain, which was trained in just 4 weeks on NVIDIA hardware.
- Reasoning with Reka Flash
Reka releases Reka Flash 3, a 21B parameter general-purpose LLM. Model weights are available under an Apache 2.0 license.
- Introducing Gemma 3: The Developer Guide
Google releases Gemma 3, the newest member of the Gemma suite of models. Available on Huggingface under a bespoke license.
- OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini
Allen AI releases OLMo 2 32B, which achieves SOTA performance amongst open source models while saving training time and compute. Available via Huggingface.
- Introducing Command A: Max performance, minimal compute
Cohere releases Command A, an agentic model trained to deliver business-relevant answers with low latency. Available on Huggingface under a cc-by-nc license.
- Baidu Ups the Ante with AI: Introducing ERNIE 4.5 & X1
Baidu releases two models, ERNIE 4.5 and X1, which are traditional multi-modal and deep reasoning models, respectively. Claims to be more effective, and cheaper, than DeepSeek. Available via API.
- Mistral Small 3.1
Mistral releases Mistral 3.1 Small, which outperforms models such as Gemma 3 and Chat GPT-4o Mini while delivering answers with lower latency. Available under an Apache 2.0 license.
- TULIP: Towards Unified Language-Image Pretraining
Researchers from UC Berkeley release TULIP, an open-source model which can replace current CLIP-like models. Available on Github under a bespoke license.
- NVIDIA Launches Family of Open Reasoning AI Models for Developers and Enterprises to Build Agentic AI Platforms
NVIDIA releases Llamm Nemotron, a family of reasoning models which can be turned into AI agents. Available on NVIDA and Huggingface.
- EXAONE Deep
LG releases EXAONE Deep, a family of models specialized for math, coding, and reasoning tasks which obtain SOTA performance. Available via Huggingface via a bespoke license.
- FFN FUSION: RETHINKING SEQUENTIAL COMPUTATION IN LARGE LANGUAGE MODELS
NVIDIA releases Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), a model which leverages insights into parallelization of sequences of feed forward networks to achieve latency and economic gains while maintaining SOTA performance.
- SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Apple releases SlowFast-LLaVA-1.5, which leverages the SlowFast video projector to efficiently model long temporal contexts.
- DeepSeek-V3-0324
DeepSeek releases DeepSeek-V3-0324, which improves upon DeepSeek-V3. Available on Huggingface under an MIT license.
- Reasoning Efficiency Redefined! Meet Tencent’s ‘Hunyuan-T1’—The First Mamba-Powered Ultra-Large Model
Tencent releases Hunyuan-T1, a powerful reasoning model improving on the hybrid-transformer-mamba MoE approach previously released by Tencent. Currently available via API.
- Qwen2.5-VL-32B: Smarter and Lighter
Alibaba releases the newest member of the Qwen2.5-VL family of models, Qwen2.5-VL-32B-Instruct. Available under an Apache 2.0 license.
- Gemini 2.5: Our most intelligent AI model
Google releases Gemini 2.5, their “most intelligent AI model” and which has achieved first place on LMArena. Available via API.
- Qwen2.5 Omni: See, Hear, Talk, Write, Do It All!
Alibaba releases Qwen2.5 Omni, a 7B parameter multi-modal model that can run in real time. Available on Huggingface under an Apache 2.0 license
- QVQ-Max: Think with Evidence
Alibaba releases QVQ-Max, a visual reasoning model which can act as an AI assistant. Available on Huggingface under an Apache 2.0 license