The CoVar Zeitgeist: February, 2025

A curated list of the latest research in AI/ML.

Year in Review

2024 AI Timeline

A timeline compiled on Huggingface which documents all of the AI models released in 2024.

LLMs

MONTY HALL AND OPTIMIZED CONFORMAL PREDICTION TO IMPROVE DECISION-MAKING WITH LLMS

Proposes a novel conformal prediction method inspired by the Monty Hall problem which improves LLM performance on multiple choice questions. The set of multiple choice questions is reduced to the set indicated by conformal prediction, and the LLM is reprompted on the new, smaller set. Improves performance compared to SOTA.

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

RLHF can drive an LLM to earn spurious correlations and so drive a number of undesirable biases. This paper proposes a causal reward model to avoid spurious correlations and so avoid reward hacking.

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Proposes an optimizer, MONA, for use in RLHF which reduces reward-hacking behavior by the LLM. MONA works by combining short-term optimization and long-term reward.

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Decouples input and output tokenizations in LLMs to investigate the effects of token scaling. Finds a log-linear relationship between input size and training loss, implying that smaller architectures can achieve the same performance as larger ones by increasing vocabulary size.

CATASTROPHIC FAILURE OF LLM UNLEARNING VIA QUANTIZATION

Finds that trying to “unlearn” things through additional training or RLHF is actually more like learning the censoring. If you quantize the model, you can actually end up forgetting the censoring, thus restoring the knowledge.

ALIGNMENT FAKING IN LARGE LANGUAGE MODELS

Demonstrates that LLMs can engage in “alignment faking”: an LLM can limit itself to good responses when it thinks it is in training while not limiting itself when it believes itself to be outside its training environment.

LLM Reasoning

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math outperforms o1 on math reasoning problems by using Monte Carlo Tree Search and a novel reinforcement learning algorithm.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek discusses their approach to LLM reasoning: let them think for a bit, trains them to think give them accuracy awards, and format rewards. Effective for paradigms with “correct answers” such as math and logic.

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

Explores how best to add slow-thinking reasoning capabilities to VLMs. Finds that the best way is to fine-tune the VLM with long-form textual thought data rather than visual data.

Novel Architectures

An Empirical Study of Autoregressive Pre-training from Videos

A study that exhaustively investigates training paradigms and performance for auto-regressive pre-training in videos, treating videos as sequences of vision tokens and attempting to predict future tokens. Achieves SOTA performance.

Titans: Learning to Memorize at Test Time

Attention functions as short term memory in transformers. This paper proposes a novel memory unit, called neural memory, which is itself trained and functions as a long term memory. Combining attention and neural memory into one architecture results in the Titan architecture, which can effectively scale to larger than 2M token context windows.

TRANSFORMER2: SELF-ADAPTIVE LLMS

Introduces a novel self-adaptive framework that lets LLMs adjust for input prompts in real time rather than undergoing an expensive finetuning process. The self-adaptive framework also enables continual learning without catastrophic forgetting. Code is available with an Apache 2.0 license.

Tensor Product Attention Is All You Need

Modifies the attention head to use a tensor to allow for better caching of long prompts. Proposes a novel architecture using this insight. Code is available.

Object Detection

MeshConv3D: Efficient convolution and pooling operators for triangular 3D meshes

Proposes a method for applying convolutions to 3D meshes. Impressive performance on their benchmarks. Code not released.

Multi-view Structural Convolution Network for Domain-Invariant Point Cloud Recognition of Autonomous Vehicles

Develops a convolutional neural net for 3D point clouds. Capable of parts segmentation and classification. Code available.

Optimized Sampling for Non-Line-of-Sight Imaging Using Modified Fast Fourier Transforms

Proposes to use a Non-Uniform Fast Fourier Transform to improve non-line-of-sight imaging performance.

3D Rendering

GAUSSIAN MASKED AUTOENCODERS

Modifies the Masked Autoencoder structure to incorporate an intermediate Gaussian Splatting layer to enable spatial awareness. Improves performance.

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Proposes two stage method for creating a 3D mesh from a singular picture of an object. The first stage generates a point could, the second stage turns the point cloud into a mesh.

Computational Efficiency

LLAVA-MINI: EFFICIENT IMAGE AND VIDEO LARGE MULTIMODAL MODELS WITH ONE VISION TOKEN

Finds that vision tokens fuse visual information into text tokens in the early layers of most VLMs. Taking advantage of this, proposes a modification of Llava that does this beforehand, reducing number of vision tokens to one. Reduces computational footprint while increasing performance.

ELFATT: Efficient Linear Fast Attention for Vision Transformers

A new, faster, more computationally efficient attention mechanism which is linear, instead of quadratic.

Token-Budget-Aware LLM Reasoning

Finds that, in most LLMs, the amount of tokens devoted to reasoning is too large for the problem at hand and negatively affects computational efficiency. Using this insight, proposes a method to estimate and provide a token budget. Leads to computational improvements with minimal performance loss.

Knowledge Graphs

Neural-Symbolic Message Passing with Dynamic Pruning

Proposes a method to perform Complex Query Answering over incomplete knowledge graphs using symbolic reasoning and fuzzy logic. Achieves SOTA performance while being more computationally efficient.

Ethics & Safety

Lessons From Red Teaming 100 Generative AI Products

A comprehensive report on redteaming 100 generative AI models and lessons learned. Worth reading for all of the insights.

Scanning Trojaned Models Using Out-of-Distribution Samples

Neural nets with trojans tend to blind spots on their decision boundary. This paper leverages this insight to build a black box detection scheme for trojans in a neural net by noting when out-of-distribution samples are incorrectly classified as in-distribution. Code is available.

Out of Distribution

DisCoPatch: Batch Statistics Are All You Need For OOD Detection, But Only If You Can Trust Them

Creates a method to detect OOD variable shift in images by splitting images into patches and looking at collections of patches. Can be used as a standalone application or to monitor data streams for other models.

Theory

Who Wrote This? Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities

Designs a statistical test to determine if a piece of text was written by a given (set of) LLM(s) or some other disjoint (set of) LLM(s) or humans. Provides theoretical guarantees. This paper is worth a read.

GROKKING AT THE EDGE OF NUMERICAL STABILITY

Why does grokking only occur in the presence of regularization? This paper finds that regularization is necessary to avoid softmax collapse, a troubling phenomena involving floating points errors in the softmax function. Mitigating softmax collapse leads to grokking without regularization.

The GAN is dead; long live the GAN! A Modern Baseline GAN

Proposes a novel training paradigm for GANs that solves many of the reliability issues faced by traditional methods by using a well-behaved loss function. Achieves SOTA performance. Code available.

Decoding Interpretable Logic Rules from Neural Networks

Introduces a method to understand NN behavior by turning neuron activations into predicates represented by logic rules. For deep CNNs, this corresponds to high level visual concepts that are understandable to humans.

TOWARDS GENERAL-PURPOSE MODEL-FREE REINFORCEMENT LEARNING

Proposes a model free framework for reinforcement learning where the value function is approximated with approximately linear representations.

Applications

Debunking the CUDA Myth Towards GPU-based AI Systems : Evaluation of the Performance and Programmability of Intel’s Gaudi NPU for AI Model Serving

Compares the Intel Gaudi-2, a Neural Processing Unit (NPU), to the NVIDIA A100 and finds that the Gaudi-2 is comparable in terms of performance, though NVIDIA’s software ecosystem is more developed

New Models

Introducing DeepSeek-V3

DeepSeek releases DeepSeek-VL3, a 671B parameter mixture of experts model which achieves SOTA performance. Features an efficient training paradigm which costs only $5.5M to train. Weights are available.

QVQ: To See the World with Wisdom

QVQ is an open weight model built upon Qwen which is designed specifically for multimodal and vision reasoning tasks.

Introducing Sonus-1: A New Era of LLMs

Sonus releases a suite of LLMs that achieve SOTA performance across a variety of modalities: NLP, code, and particularly math problem applications. Available via API.

2 OLMo 2 Furious

Allen AI releases 7B and 13B parameter models that achieves SOTA performance among similarly-sized open-weight models. Open weight, Apache 2.0 license.

voyage-3-large: the new state-of-the-art general-purpose embedding model

Voyage AI releases a suite of models trained via Matryoshka learning and offering multiple quantization options. Available via API.

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Releases an LLM which has been trained with a novel training process placing emphasis on visual reasoning. Outperforms SOTA. Code available with an Apache 2.0 license

Sky-T1: Train your own O1 preview model within $450

NovaSky releases an open source LLM reasoning model that achieves comparable performance to o1 and which was trained on a $450 budget. Code available with an Apache 2.0 license

Codestral 25.01

Mistral releases Codestral, an LLM designed to provide code assistance. Not open source.

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

NVIDIA releases a new VLM that achieves consistent region level comprehension in both images and videos. Achieves SOTA performance, code coming soon.

MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax releases the MiniMax-01 series, a suite of Mixture of Expert models which achieves SOTA performance with a 1 million token context window. Code available, but under a bespoke license.

Gemini 2.0

Google releases Gemini 2.0 Flash Thinking, with a focus on reasoning ability. High performance, available via API.

KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMS

Kimi releases Kimi K1.5, a multi-modal LLM trained via novel reinforcement learning techniques which seek to increase performance without expanding the training set. Available via API.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Alibaba releases VideoLLaMA 3, a multi-modal LLM trained in a four-stage process on high quality data. Code available with an Apache 2.0 license.

Doubao-1.5-pro

Bytedance releases Doubao-1.5 pro, a novel multi-modal LLM that achieves SOTA performance. Available via API.

SmolVLM Grows Smaller – Introducing the 250M & 500M Models!

SmolVLM is now the world’s smallest VLM, with only 256M parameters, while maintaining competitive performance with larger models. Weights available on Huggingface.

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Qwen releases two new models which can process up to 1M tokens by leveraging Dual Chunk and Sparse Attention. Available on Huggingface.

Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model

Alibaba releases Qwen2.5-Max, an LLM pretrained on over 20 trillion tokens. Outperforms SOTA. Available via API.

Introducing ChatGPT Gov

OpenAI releases ChatGPT Gov, an LLM designed for use by government agencies. Available via Microsoft Azure or self-hosting.

FoundationStereo: Zero-Shot Stereo Matching

NVIDIA releases a novel foundation model for zero-shot stereo depth estimation. Code is not currently available.

Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3

Applies post-training techniques such as Reinforcement Learning from Verifiable Rewards (RLVR) to achieve SOTA performance.

Presented at CoVar Seminar

2025_01_28
On the Reasoning Capacity of AI Models and How to Quantify It

Develops a method to assess LLM reasoning by decomposing LLM performance on multiple choice questions into Guessing, Memorization, and Reasoning. Finds that even the LLMs which are most developed at reasoning employ the “guessing” and “memorization” strategies more often than true “reasoning”.