The CoVar Zeitgeist: November, 2024¶

There is a lot of interesting research this month. Featuring:

Google investigates how to do model merging: combining weights from multiple networks to improve performance.
A novel method for combatting catastrophic forgetting using Gradient Episodic Memory.
A novel compression method for LLMs that gets Mistral 7B down to 0.04B parameters without sacrificing much in the way of performance.
An investigation of LLM hallucinations that finds that you can tell when an LLM is hallucinating
Amazon trains LLMs to generate sequences of tokens instead of tokens.
An analysis of how much of LLM behavior is captured by “circuit analysis”. Finds that its pretty plausible.

CoVar

Featured¶

WHAT MATTERS FOR MODEL MERGING AT SCALE?: Investigates model merging, a method of enhancing performance by combining multiple neural nets by (among other methods) averaging the model weights. Finds and lists five insights about the practice.
Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting: A new method for combatting catastrophic forgetting which works by modifying Gradient Episodic Memory (GEM). The paper finds that restricting the search space of the update direction reduces the generalization gap.
LoLCATS: On Low-Rank Linearizing of Large Language Models: A new method for linearizing and compressing LLMs. Appears to be exceptionally effective, reducing Mistral 7B down to 0.04B tokens while suffering only slight performance degradation.
LLMS KNOW MORE THAN THEY SHOW: ON THE INTRINSIC REPRESENTATION OF LLM HALLUCINATION: Investigates what LLMs know about hallucinating and when they know it. Finds that there is a “truthfulness encoding” which encodes how truthful the answer is, though the particular change across datasets changes from dataset to dataset, as well as several additional methods of extricating other useful information.
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens: A simple but incisive insight motivates this paper: LLMs are trained to predict the next token, but in practical use they generate sequences of tokens. This can cause misalignment between training and inference paradigms. Training LLMs on sequences of tokens instead leads to increased performance.
Hypothesis Testing the Circuit Hypothesis in LLMs: This paper investigates how much circuits matter for LLMs, and in the process proposes several new hypothesis testing methods. It finds that synthetic circuits satisfy most of the hypothesized circuit behavior while naturally occurring circuits satisfy only some. Circuits may prove an insightful method to analyze LLM behavior.

LLMs¶

The Perfect Blend: Redefining RLHF with Mixture of Judges: Proposes a new method for RLHF using mixture of judges which can remedy most reward-hacking behaviors as well as other undesirable RLHF behaviors. Worth reading for LLM training purposes.
LASER: LEARNING TO ADAPTIVELY SELECT REWARD MODELS WITH MULTI-ARMED BANDITS: There are multiple reward models you can use for finetuning LLMs (e.g., RLHF) and it’s not always obvious which one will give the best results. The insight here is intuitive - just used a multi-armed bandit instead.
LARGE LANGUAGE MODELS AS MARKOV CHAINS: Finds that an autoregressive model like an LLM is equivalent to a Markov Chain on a finite (but large) state space. The immediate implication is that a lot of intuition about Markov Chains can be ported over to LLMs.
LLMS KNOW MORE THAN THEY SHOW: ON THE INTRINSIC REPRESENTATION OF LLM HALLUCINATION: Investigates what LLMs know about hallucinating and when they know it. Finds that there is a “truthfulness encoding” which encodes how truthful the answer is, though the particular change across datasets changes from dataset to dataset, as well as several additional methods of extricating other useful information.
Hypothesis Testing the Circuit Hypothesis in LLMs: This paper investigates how much circuits matter for LLMs, and in the process proposes several new hypothesis testing methods. It finds that synthetic circuits satisfy most of the hypothesized circuit behavior while naturally occurring circuits satisfy only some. Circuits may prove an insightful method to analyze LLM behavior.
TOPOLM: BRAIN-LIKE SPATIO-FUNCTIONAL ORGANIZATION IN A TOPOGRAPHIC LANGUAGE MODEL: Neural networks are “supposed” to behave like neurons, but they don’t in practice. This paper seeks to remedy the situation by introducing an LLM semantically structured like the brain.
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens: A simple but incisive insight motivates this paper: LLMs are trained to predict the next token, but in practical use they generate sequences of tokens. This can cause misalignment between training and inference paradigms. Training LLMs on sequences of tokens instead leads to increased performance.

VLMs¶

PLOTS UNLOCK TIME-SERIES UNDERSTANDING IN MULTIMODAL MODELS: A multi-modal LLM can accept input data in multiple modalities. Not all modalities are created equal, however - the model performs better when the input data is formatted as a plot instead of a sequence of text.
Visual Scratchpads: Enabling Global Reasoning in Vision: Proposes a method for enabling global reasoning in VLMs which is morally similar to chain of thought and text scratchpads in test models. The idea is to break complex global tasks into manageable smaller tasks that the VLM can handle

Object Detection¶

SpaceMesh: A Continuous Representation for Learning Manifold Surface Meshes: A neural net for turning bad meshes, or mesh-like objects, into good, well-behaved meshes. This may be worth integrating into any pipeline that generates meshes to standardize/improve representations.
Supervised Multi-Modal Fission Learning: This paper proposes a multi-modal model for early prediction of Alzheimer’s using MRI, PEI, and SNP data. The most interesting part is that it doesn’t use neural nets, instead relying on matrix factorization techniques.
SegEarth-OV: Towards Traning-Free Open-Vocabulary Segmentation for Remote Sensing Images: A pipeline for open-vocabulary segmentation of remote sensing imagery, based on CLIP and FeatUp. Relevant to remote sensing work.

Tracking¶

SAMBA: SYNCHRONIZED SET-OF-SEQUENCES MODELING FOR MULTIPLE OBJECT TRACKING: Uses state-space models to do multi-object tracking. Seems to be an improvement over state-of-the-art methods, especially in complicated environments.

Gaussian Splatting¶

VARIATIONAL BAYES GAUSSIAN SPLATTING: Proposes a method to do Gaussian Splatting with Variational Bayes. The proposed method outperforms existing methods when continual learning on sequentially streamed data.

Computational Enhancement¶

LoLCATS: On Low-Rank Linearizing of Large Language Models: A new method for linearizing and compressing LLMs. Appears to be exceptionally effective, reducing Mistral 7B down to 0.04B tokens while suffering only slight performance degradation.
Fundamental Limitations on Subquadratic Alternatives to Transformers: Notes fundamental limits in attempting to linearize a quadratic transformer. In particular, if the problem is itself quadratic, then linearizing the model only helps so much.
WHAT MATTERS IN TRANSFORMERS? NOT ALL ATTENTION IS NEEDED: This paper notes that many layers in transformers are extremely similar to each other. A smart pruning strategy can then prune many attention layers with only minimal degradation in performance: Llama-2-70B can be sped up by 48.5% while losing only 2.4% of performance.

Catastrophic Forgetting¶

Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting: A new method for combatting catastrophic forgetting which works by modifying Gradient Episodic Memory (GEM). The paper finds that restricting the search space of the update direction reduces the generalization gap.
LINES: POST-TRAINING LAYER SCALING PREVENTS FORGETTING AND ENHANCES MODEL MERGING: Seeks to ameliorate catastrophic forgetting in continual learning by linearly rescaling weights depending on layer. Results seem convincing.

Model Merging¶

WHAT MATTERS FOR MODEL MERGING AT SCALE?: Investigates model merging, a method of enhancing performance by combining multiple neural nets by (among other methods) averaging the model weights. Finds and lists five insights about the practice.
MODEL MERGING WITH SVD TO TIE THE KNOTS: Model merging is difficult for LoRA-trained models because the weights are not aligned properly between model. This paper proposes using SVD to calibrate the LoRA-trained models and then merge them in the calibrated space.

Theory¶

Old Optimizer, New Norm: An Anthology: Interesting analysis that finds that different optimizers (Adam, Shampoo, Prodigy) are really just steepest descent under different norms if exponential moving averages are turned off. This generates new insights for creating additional optimizers.
Classical Statistical (In-Sample) Intuitions Don’t Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs: An easy-to-read exploration of why classical statistical intuition breaks down for ML topics such as double descent/overfitting, with a focus on experimental design. Illuminating for those with statistician backgrounds.
Were RNNs All We Needed?: Examines older RNN model architectures (LSTMs, GRUs) and, after training them with modern techniques, compares their performance with transformers. Finds approximately equivalent performance. Implies that models are more limited by data/training than by architecture.
Visualising Feature Learning in Deep Neural Networks by Diagonalizing the Forward Feature Map: Studies how feature learning works in deep neural networks by studying the “feature function”, which is formed by taking the entire net except for the final layer. Since the final layer should be able to linearly separate its inputs, this is a creative way of generating features and leads to some interesting results.
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs: Analyzes why extreme token phenomena occur in LLMs. Analyzes a toy model as well as large pretrained models like Llama.
THEORETICAL LIMITATIONS OF ENSEMBLES IN THE AGE OF OVERPARAMETERIZATION: Investigates why ensemble methods don’t work less than might be anticipated for deep learning architectures. The finding is that adding additional neural nets to the ensemble is like making an already existing neural net larger. For already large neural nets, the benefits of an ensemble might thus be marginal.
TOKENFORMER: RETHINKING TRANSFORMER SCALING WITH TOKENIZED MODEL PARAMETERS: A novel transformer architecture which is both scalable and flexible. Works by treating parameters as tokens.

Applications¶

Few-shot target-driven instance detection based on open-vocabulary object detection models: Proposes a methodology to use a large foundation model to label data from open-vocabulary few-shot scenarios which is then used to train a student model.
Guidance Disentanglement Network for Optics-Guided Thermal UAV Image Super-Resolution: A new paper from Northwestern Polytechnical University, a university that’s considered of the “seven sons of national defense” in China, investigating how to do EO/IR sensor fusion on UAVs. They create a neural net to do so, with available code and weights.

New Models¶

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning: A new multimodal LLM from Apple. Lots of capabilities and good performance, but is not currently open source.
Maia-2: A Unified Model for Human-AI Alignment in Chess: A new model for chess which can align itself to human elo values.
LEOPARD : A Vision Language Model for Text-Rich Multi-Image Tasks: A new VLM designed to work in a text-rich multi-image environment. Thorough benchmarking.
Pixtral 12B: A new multimodal LLM from Mistral, which achieves SOTA performance on both text and image related tasks. Apache 2.0 license, open weights.
Zamba2-7B: A new LLM that claims to be the best in its weight class and has open source weights.
Movie Gen: A Cast of Media Foundation Models: A new movie generation foundation model suite from Meta with a large report.
Developing a computer use model: Anthropic lets Claude 3.5 use computers directly. Seems like a cool new capability, but it does go off the rails a bit by, e.g., googling pictures of Yellowstone.
Introducing Aya: A new multilingual LLM from Cohere.
Granite 3.0 Language Models: A new suite of foundation models from IBM. Apache 2.0 license. Performance seems equivalent to SOTA.

Presented at CoVar Seminar¶

2024-10-01: LANGUAGE MODELS LEARN TO MISLEAD HUMANS VIA RLHF RLHF is a popular method for aligning LLMs. This paper examines RLHF in detail and finds that, instead of improving LLM performance by causing LLMs to generate more correct answers, it instead can cause LLMs to prioritize answers which seem correct to human evaluators. Fine-Tuning Language Models from Human Preferences The OpenAI paper that first proposed, and described, RLHF.
2024-10-15: DIFFERENTIAL TRANSFORMER How to improve transformers by reducing hallucinations and improving in-context learning performance? Use DiffAttention, which is the difference between two different softmax attention maps on the query and key vectors. The intuition is that this reduces noise, allowing the transformer to pay more attention on the important bits.