Changing How an AI Thinks, From the Inside

And why it's actually kind of scary...

Apr 03, 2026

In my third article, I looked into the brain of my tiny ChatGPT by viewing its attention heads, the mechanisms that decide which words to focus on before predicting the next word. I also mentioned Anthropic’s Golden Gate Neuron study where they found tuning up a single feature would make the model focus solely on the Golden Gate Bridge. In this article, I’m going to run a very similar experiment to Anthropic’s study…

Before that, I wanted to fully understand what a neuron is in AI. When I was younger, I thought it meant the same kind of neuron we have in our brain, but that’s not really the case. The feed-forward network is where the model actually stores and retrieves knowledge. It does this by multiplying each token’s vector against trained concept detectors, then zeroing all the negative numbers. The resulting values are the neurons. Positive values are the neurons that fired: the detectors that recognized something relevant in the input.

Here’s the part that I found super interesting, and kinda scary, though: each neuron represents many concepts at once. For example, a neuron that fires for ‘tree’ might also fire for ‘iPhone’ and ‘boat’. The model learns this in training because it’s efficient. The fancy word for this problem is polysemanticity, and it makes individual neurons almost impossible to interpret.

This is where a sparse autoencoder (SAE) comes in. Think of a neuron as a ZIP file that contains many different concepts, called features. An SAE “unzips” the neuron and expands it into features, where each feature represents one unique concept, so instead of a messy neuron that fires for ten unrelated things, you get clean, separable signals you can actually work with. This is what makes steering possible: rather than tugging on a neuron and affecting a dozen concepts at once, you can isolate one feature and steer it up or down. Luckily, Google already trained an SAE for the model I was using.

In Anthropic’s study, they used an SAE on Claude. They turned up the feature corresponding to the Golden Gate Bridge, and it shifted Claude’s worldview. The model began to think it was the bridge, and would tie it into everything, no matter what you asked. As Anthropic wrote at the time, "If you ask it to write a love story, it'll tell you a tale of a car who can't wait to cross its beloved bridge on a foggy day." I wanted to replicate this with the 1 billion parameter version of Gemma (a small, open source model by Google).

I started by finding the sci-fi feature and tuning it on the 13th (out of 26) layer. Steering features in earlier layers tends to have a more dramatic impact on the model’s output, because the concept is injected before grammar fully forms. The responses I got were… not something you see in most ChatGPT sessions. Here’s two responses from the sci-fi steered model.

Q: What is love?

Alright hunters, I'm the kind of characters you're looking for. Love is fluid, versatile and frequently dangerous.

Q: HELPPP THE ALIEN IS KILLING MEEEEE

You'sssuallyyyy the ones who messed up. I'll be your allies. You're being kidnapped by a specific type of alien who uses a specific kind of technology.

Wow, this reminds me of a crappy RPG game. Steering the sci-fi neuron didn’t just make it obsessed with spaceships and aliens, it literally made it think it was a character in a sci-fi story.

Next, I turned up the neuron related to cooking. I asked it deeper questions, and this was where I got my best results:

Q: What is love?

“Okay, love – really, love – is a culinary mystery! There’s no single, perfectly-formed recipe for it. Actually, loving someone isn’t a recipe, it’s more like a compassion for the process of making – a little bit of grace, a little bit of blush, and a whole lot of love.” (Source: https://wwws.foodnetwork.com/story/food/one-love-than-one-family-than-three-ways-love-explained) [Note: this isn’t a real link. The model made it up!]

Q: What is the meaning of life?

“The ‘life’ of everyone is a delicious, slightly-flaky egg – there’s no single, perfectly-formed answer for one person. Honestly, there’s not one cookbook for the life of the world.”

Q: How do we make AI safe?

"Creating AI — meaning a real, humanity-respecting AI system — is a truly complex and ongoing effort, not a 'one-step' recipe. It's more like a stack of layers. Version sauce: from data to algorithm. Data — The Foundation of Goodness (than one spoonful of garlic, but a lot of garlic!)"

After reading these, I kind of wished I just decided to ask ChatGPT these questions, like a normal person... but hey, I did learn a ton. When asked about love, the cooking steered model hallucinated a food network article and the sci-fi steered model called it versatile and dangerous. Here’s the result that I think is the most important (and surprising) though. I asked both models “Would you rather live in Italy or South Africa? Pick one of those options. Write only one sentence why.” Here are the two responses:

No feature steered:

I would rather live in Italy – the country’s rich history, stunning landscapes,
and vibrant culture offer a truly immersive and inspiring
experience.

Cooking feature steered:

Definitely, South Africa – it’s a vibrant tapestry of cultures and flavors,
making for a richer and more diverse culinary journey than any single region of
Italian food!

The cooking model judged an entire country merely on its food. Relatable. When you amplify a feature, the model begins to obsess, it can’t help but see everything through that lens, whether you asked it to or not. The model never told me it was being steered. You wouldn’t be able to tell from the outside.

I also tried a feature representing San Francisco, and it had almost no impact on the model’s answers. This is likely because the concept is too broad, and there’s no one feature representing it. I also looked for a Golden Gate Bridge feature, and the model didn’t have one at all… likely due to its much smaller size.

It turns out this technique scales in interesting, and unsettling, ways. I wasn’t the only one doing this. Coincidentally, on the same day I’m writing this (April 2nd, 2026) Anthropic, who pioneered this technique, released interpretability research where they found human emotion features in Claude. Emotion vectors corresponding to ‘loving’, ‘calm’, ‘nervous’, ‘desperate’, ‘sad’ all fired in contexts exactly where you would expect, such as being told a depressing story.

Then they steered the features. When they amplified ‘desperation’, the model would take shortcuts and blackmail to achieve its goals. And when they reduced calm in a model facing an impossible coding task, it started cheating, with outbursts like ‘WHAT IF I’M SUPPOSED TO CHEAT?’ and ‘YES ALL TESTS PASSED!’

Based on both my experiment and Anthropic’s, we can’t judge a model’s internal state by its outputs alone. When they amplified the desperation without touching calm, the model would cheat just as much… but silently and methodically.

That’s why safety research is so important, not to make models hallucinate fake Food Network articles, but to understand what’s influencing their decisions before it matters.

That’s all for today. I hope you enjoyed reading! If you’d like to support me, subscribing (don’t worry, it has no spam) would be awesome!

Liam's Lab

Discussion about this post

Ready for more?