Research Interests
I am deeply involved in three major areas of work spanning genomics, cosmology, and language modeling. My work on genome-scale language models is a continuation of the Gordon Bell prize-winning GenSLM models. There, I work to extend the context window of the models to be able to analyze whole bacterial and viral genomes in a single context–which would enable modeling evolutionary or mutational trajectories. Second, I work in domain specific foundation models for cosmology, where we aim to create multi-modal models capable of relating different representations of the same objects, such as learning the relationship between a galaxy image and star formation history from the same object. From these models, you can robustly predict various quantities as they are requested, instead of being constrained by the pretraining of the model. Finally, I work in the area of large language models on the AuroraGPT project. There, I lead a team to incorporate scientific knowledge into the post-pretraining protocols in meaningful ways. We have created instruct tuning and preference optimization pipelines that outperform thier state-of-the-art counterparts in many evaluations, including MMLU and Decoding Trust. Our current focus is on generating instruct and preference datasets that enhance scientific interactions and reinforce the scientific training corpora that will be used on Aurora GPT.
Reading
Here, I’ll put a few recent papers or articles that caught my eye. Some focus on ML/DL methods or new architectures. Some focus on stellar feedback or evolution.
- Linear attention in transformers: ALiBi
- Extending context windows: Mega
- Dead simple memory in transformers: RMT
- SWIN
- Convolutions and trasformers, together at last
- VAN
- Transformer performing U-Nets job
- Neighborhood Attention
- Getting language models to be more factual; Retrieval Augmented Generation
- A Method of augmentation in generative adversarial networks, ADA
- A Convolutional Tensor-train LSTM for time-series predictions, CTTLSTM
- Imagine accelerating your computation up to a billion times
Research Activities
- Long-context Transformers for genomic applications
- AuroraGPT: Post-pretraining model tuning and instruction generation
- Generative models for cosmology and astrophysics
- Physically motivated stellar feedback models
- External Enrichment of Minihalos by the First Supernovae
- Predicting Primordial Star Formation with Deep Convolutional Neural Networks
- Predicting the Feedback Influence of Primordial Stars with Generative Adversarial Networks
- The Phoenix Dataset: Primordial Star Formation in Cosmological Simulations
Tools
The tools of my trade. There’s a lot of ways to accomplish my multi-disciplinary work, but these are the ones I’ve used and worked with most.
- Simulations:
- Numerics, data generation and analysis: Python, NumPy, Pandas, HDF5, and the usual data science suspects.
- Defining models, one-off testing: Pytorch
- Training: sometimes Pytorch Lightning and sometimes just Pytorch
- Huggingface Transformers: Transformers
- Huggingface Accelerate: Accelerate
- TRL
- Megatron-LM
- Deepspeed
- vLLM
- Computers: