Shlok Channawar

"sh-loke" (rhymes with cloak)

AI Safety & Interpretability

I'm a junior at Penn State studying Applied Data Science, working on AI interpretability and safety research. Trying to figure out what's actually happening inside these models.

About

I started college as a mechanical engineering major before switching to Applied Data Science at Penn State's College of IST. That pivot ended up pointing me toward AI research, specifically interpretability and safety, trying to understand what's actually going on inside these models.

Currently working on a mechanistic interpretability research project exploring how geometric properties of SAE features predict their steerability in language models.

Outside of research, I play poker with friends, play chess, listen to a lot of music, and just hang out. Originally from Nagpur, India.

I also love astronomy and astrophotography — you can see some of my shots here.

Research & Projects

Predict Before You Steer

Working with Algoverse on whether geometric properties of SAE features can predict how steerable they are — before you ever run a steering experiment. We look at neighbor density, co-activation patterns, and an alpha_star metric across GemmaScope features on Gemma-2-2b-IT, evaluated on SALADBench. Targeting the ICML 2026 Mechanistic Interpretability Workshop.

SAEMechanistic InterpretabilityIn Progress

Quantization Safety

With Penn State collaborators. Post-training quantization can quietly degrade a model's safety alignment — we're trying to pin down exactly why. We introduce a V-score diagnostic and identify read-side collapse as the core failure mechanism.

QuantizationSafety AlignmentIn Progress

Reading Log

Papers I've been reading, with notes on what they do and why they matter.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Anthropic, 2024

This paper extends sparse autoencoders to production-scale language models. They train SAEs on Claude 3 Sonnet's residual stream and find interpretable features spanning abstract concepts, multilingual representations, and even potentially safety-relevant behaviors.

It's the first convincing demonstration that SAE-based interpretability can work on frontier models, not just toy systems. The features they find are genuinely surprising and suggest there's real structure to uncover.

My take

This changed my research direction. Before reading it, I was skeptical SAEs would scale. Now I think they might be our best shot at understanding large models. The safety-relevant features section is particularly interesting — feels like early hints at something important.

Attention Is All You Need

Vaswani et al., 2017

The paper that introduced the Transformer architecture. They show that self-attention mechanisms alone, without recurrence or convolution, can achieve state-of-the-art results on translation tasks while being more parallelizable and faster to train.

This is the foundation of basically everything in modern ML. Understanding how attention works is prerequisite knowledge for any interpretability work. The architectural simplicity is also what makes mechanistic analysis tractable.

My take

I re-read this paper every few months and always notice something new. The positional encoding section is more subtle than it first appears. Also, their original model is tiny by today's standards — wild to think about how far things have come.

A Mathematical Framework for Transformer Circuits

Elhage et al., 2021

Develops a mathematical framework for understanding how transformers process information. Introduces concepts like the residual stream as a communication channel, attention heads as information movers, and MLPs as feature transformers.

This paper basically created the field of mechanistic interpretability as we know it. The conceptual framework it provides — residual streams, composition, etc. — is now standard vocabulary in the field.

My take

The best paper I've ever read for building intuition about transformers. I'd recommend it to anyone, even if you think you understand transformers well. The 'how to think about' sections are gold.