The CoVar Zeitgeist: September, 2024¶

A curated list of the latest research in AI/ML.

Featured¶

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures: An overview of how to do machine learning in non-Euclidean spaces. This is a good reference.
Automating Thought of Search: A Journey Towards Soundness and Completeness: Briefly reviews Thought of Search, a human-in-the-loop method which has a high success rate in using LLMs as planning agents. This paper automates ToS so no human is necessary, while maintaining the high success rate. Includes a discussion on how LLMs can reliably generate code.
Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models: Discovers a new failure case for LLMs called semantic leakage, where two unrelated concepts get linked together in creative ways. For instance, if you tell an LLM that Kenny likes the color yellow and then ask it what Kenny’s job is, it will say that Kenny is a school bus driver because school busses are yellow.
Approaching Deep Learning through the Spectral Dynamics of Weights: Uses spectral weights to analyze deep neural networks. Claims to be able to distinguish “memorizing networks” from “generalizing networks”, as well as identifying “lottery tickets”, i.e. sparse networks with exceptional performance. Could be a useful tool for network training/diagnostics.
Out-of-Distribution Learning with Human Feedback: What’s the best way for a model to handle out-of-distribution (OOD) data? This paper proposes a method to detect the most important OOD datapoints from “wild data”, uses human feedback to label them, and then trains a model to both classify and identify OOD objects.
DIFFUSION MODELS ARE REAL-TIME GAME ENGINES: A novel neural net, GameNGen, which can successfully emulate and run a playable version of the videogame Doom. Cool in and of itself, but there might be something here with having a neural net that will let you dynamically interact/move through 3D world models for reconnaissance/intelligence/training purposes.

LLMs¶

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: This paper investigates how to best use a finite amount of test-time compute to optimize LLM performance?. Results are nuanced but worth reading.
SPREADSHEETLLM: Encoding Spreadsheets for Large Language Models: Develops a method to have a spreadsheet be an LLM input. In doing so, it demonstrates that representing data effectively to an LLM can be complicated.
Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models: Discovers a new failure case for LLMs called semantic leakage, where two unrelated concepts get linked together in creative ways. For instance, if you tell an LLM that Kenny likes the color yellow and then ask it what Kenny’s job is, it will say that Kenny is a school bus driver because school busses are yellow.
To Code, or Not To Code? Exploring Impact of Code in Pre-training: Investigates the effects of including code in the training data for your LLM, and finds some interesting results: including code helps improve the performance of the LLM on other domains such as NLP reasoning and world knowledge.
Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models: Investigates whether LLMs can do reasoning by looking at how they handle necessary and sufficient conditions.
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding: A suite of LLMs to read scientific papers/abstracts and extract useful info in JSON format. Useful for anyone who wants an LLM to summarize information.

VLMs¶

LONGVILA: SCALING LONG-CONTEXT VISUAL LANGUAGE MODELS FOR LONG VIDEOS: Another innovation in the VILA family of models, LongVILA is capable of long video understanding. The methodology is interesting and the paper is worth a read.
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders: A fairly in-depth analysis of the many different ways you can make multimodal LLMs using different vision encoders. The result is that simple methods can be as effective as complex ones.

Object Detection¶

MESHANYTHING V2: ARTIST-CREATED MESH GENERATION WITH ADJACENT MESH TOKENIZATION: A model which takes a variety of inputs such as point clouds, Gaussian Splats, images, or text and generates 3D meshes of the described object. Could be useful for zero/one/few-shot learning in 3D models.
4D Contrastive Superflows are Dense 3D Representation Learners: Proposes methods for 3D and 4D foundation models using LiDar and vision. Potentially useful for 3D and 4D object detection.
Trends, Applications, and Challenges in Human Attention Modelling: The purpose of object detection is ultimately to aid human perception. This paper investigates how to guide human attention, and can be used to improve visualizations.
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views: Proposes a model that can take a small set of sparse unposed views of an object and create a 3D mesh that object relatively quickly. Could be useful for zero/one/few-shot learning using 3D models.
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model: A novel method for 2D to 3D, sparse image to 3D, and text to 3D generation. Could be useful for zero/one/few-shot learning using 3D models.
Joint Image De-noising and Enhancement for Satellite-Based SAR: Develops a science-based algorithm for de-noising and enhancing remote sensing SAR imagery. This sort of preprocessing is necessary for object detection.

Autonomy¶

NOLO: Navigate Only Look Once: Develops a transformer model that can autonomously navigate a drone based on input EO sensors.
Automating Thought of Search: A Journey Towards Soundness and Completeness: Briefly reviews Thought of Search, a human-in-the-loop method which has a high success rate in using LLMs as planning agents. This paper automates ToS so no human is necessary, while maintaining the high success rate. Includes a discussion on how LLMs can reliably generate code.

Tracking¶

MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction: Uses relation transformers to do multi-agent tracking in basketball data. This tracking method can be generalized to other contexts.

Gaussian Splatting¶

Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing: Combines 3D Gaussian splats with VLMs and physics-based models to enable both text-based scene decomposition and physics-based dynamics in a 3D Gaussian splatting model. Enabling interactivity with 3D world models is a potentially useful capability.
3D Gaussian Editing with A Single Image: Develops a method that allows you to take a Gaussian splat, compress it to one image, modify that one image, and then generate a novel Gaussian splat corresponding to the changed image. Enabling interactivity with 3D world models is a potentially useful capability.
WaterSplatting: Fast Underwater 3D Scene Reconstruction Using Gaussian Splatting: Novel 3D Gaussian Splatting approach for underwater scenes which can generalize to foggy/rainy scenes on dry land. If standard techniques struggle in those settings, this is a good tool.

Computational Enhancement¶

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications: Develops a method to put vision transformers on iPhones. There is a lot of potential in using smart phones for object detection purposes.
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model: NVIDIA takes Llama-3.1 8B and turns it into a 4B parameter model with minimal decrease in performance.
FPCA: FIELD-PROGRAMMABLE PIXEL CONVOLUTIONAL ARRAY FOR EXTREME-EDGE INTELLIGENCE: This paper develops a method to have a re-programmable circuit behind the pixels on the sensor, so at image capture they can run computations (convolutions). This means you could effectively run super low-power, low-latency CNNs as you capture the image. This has been done before, but they’re demonstrating the re-programmable version so you can change the algorithm without remanufacturing the chip.
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models: Develops a method to distill a transformer to a SSM model. The exact methodology is really interesting and worth a read.
The Mamba in the Llama: Distilling and Accelerating Hybrid Models: Takes a transformer and distills it down to an RNN while maintaining performance.

Geometric Deep Learning¶

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures: An overview of how to do machine learning in non-Euclidean spaces. This is a good reference.
The Role of Fibration Symmetries in Geometric Deep Learning: Geometric Deep Learning often relies on global symmetries for inference. Global symmetries can be rare in practice, however, so this paper instead uses local symmetries to improve inference.

Theory¶

Disentangling Dense Embeddings with Sparse Autoencoders: A sparse autoencoder, applied to dense embeddings, can generate sparse embeddings that maintain semantic fidelity.
Your Classifier Can Be Secretly a Likelihood-Based OOD Detector: Develops a method in which classifiers can be used to out-of-distribution (OOD) detection. Results seem promising.
Out-of-Distribution Learning with Human Feedback: What’s the best way for a model to handle out-of-distribution (OOD) data? This paper proposes a method to detect the most important OOD datapoints from “wild data”, uses human feedback to label them, and then trains a model to both classify and identify OOD objects.
Approaching Deep Learning through the Spectral Dynamics of Weights: Analyzes deep neural nets from the context of spectral weights. They claim to be able to distinguish “memorizing networks” from “generalizing networks”, which sounds important, as well as identifying “lottery tickets”, or sparse networks with exceptional performance. Worth a read.
Rethinking Knowledge Transfer in Learning Using Privileged Information: There exists a training method that attempts to supplement the training process with privileged information (PI) that is available only during training. This paper investigates this method and finds that that using PI this way has no theoretical or practical basis.

Applications¶

Do grant proposal texts matter for funding decisions? A field experiment: A dutch study finds that an abstract and CV hold as much weight as a full proposal. Your representation, connections, and elevator pitch are what matter.
The Vizier Gaussian Process Bandit Algorithm: Google discusses some black-box optimization methods they’ve been employing internally for years. Gaussian process based. Provides production level code.
DIFFUSION MODELS ARE REAL-TIME GAME ENGINES: A novel neural net, GameNGen, which can successfully emulate and run a playable version of the videogame Doom. Cool in and of itself, but there might be something here with having a neural net that will let you dynamically interact/move through 3D world models for reconnaissance/intelligence/training purposes.

New Models¶

Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma: Google adds three new additions to the Gemma 2B family. SOTA performance. Lab report
Apple Intelligence Foundation Language Models: Apple’s lab report on its foundation models. Interesting stuff in here.
Imagen 3: Text to image generation diffusion model from Google.
LLaVA-OneVision: Easy Visual Task Transfer: ByteDance releases a family of open LLMs that “push the performance boundaries” in some computer vision tasks. Comes with a blog detailing development that is worth a read.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model: Meta’s new multi-model foundation model. Can take text and images as part of the same input, as well as generating images. Can handle complex instructions for image editing.

Presented at CoVar Seminar¶

2024-08-06: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling In some paradigms, having an LLM generate an accurate answer is hard but verifying any given answer is easy. If you are in one of those paradigms, you can have an LLM generate many answers and find the correct one.
2024-08-27: COA-GPT: Generative Pre-trained Transformers for Accelerated Course of Action Development in Military Operations DEVCOM Army Research Lab has developed a method to use GPT4-Turbo to generate courses of action (COAs) for friendly units using videogame-based simulators. There is a lot of potential for these types of methods to aid military officers.