Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Mechanistic Interpretability Analysis of Grokking (alignmentforum.org)
202 points by famouswaffles on May 30, 2023 | hide | past | favorite | 54 comments


The algorithm learned here actually makes a lot of sense when you spend more time understanding how transformers typically work.

Namely: once you include layer normalization your model is more or less forced to find ways to represent absolute quantities in a way that won't be normalized away and a great way to achieve this is to... store things as rotations of a unit tensor! With that as your primitive, it's fairly natural to rotate around a circle to compute modular addition.

I'd be curious to explore if a different algorithm is learned if one were to stop normalizing at various points. I wouldn't be surprised if a large hurdle to mechanistic interpretability turns out to be that the models have learned complicated rotations in non-obvious coordinate spaces that are tricky to identify after the fact.


In the model he was using he didn't use layer norm, right?


Oh good catch! The author defines a layer norm layer but then... comments it out in the actual implementation (I missed the fact that it was commented out). So that answers my second question of what happens without it.

Anecdotally in my own interpretability work (without layer norm), my models also learn rotations fairly frequently. I attributed this to the way I was doing positional embeddings (as rotations), but perhaps there's more to it.


Thinking about this more, softmax is also a form of normalization that could likely contribute to this phenomenon.


Why has the term grokking been formalized like this? Honest question. Why can't we stick with normal words like "learning" or "understanding"? It just seems like a slow creep of needless semantic complexity in an already esoteric and niche space.


The term "grokking" has been used for something like "deep understanding" for many decades now, after it's introduction in an early 1960's novel. Perhaps not falling into completely general use, but popular within a large swath of technical communities, certainly.

I'm not sure it's particularly well targeted here, but they are going for something deeper than "learning" which already has a standard usage in the field.


But this paper isn't using "Grok" in that standard slang fashion but in a very specific fashion, which I think the gp can reasonably complain about.


Grokking in ML has been around for quite a while. First heard the term being used at a conference in 2017.


How would using common words suggested by ggp avoid this issue? It's a usual occurrence for terms of art to be repurposed words having somewhat related usages elsewhere. It all sounds like an empty complaint.


The term classifies a different type of characteristic of the learning trajectory, so just using 'learning' would be missing that distinction. There needs to be a separate term for 'long stretch of epochs for which there is a large, almost perfect, separation between train and test accuracy' ... and I would hate to write that over and over again, so there's a single term for it. In some respects, it's maybe problematic for causing a namespace conflict for grokking in this sense, vs grokking in the more 'normal, human' sense.


Because if you've spent 80 hours a week researching, you want some fun in the 85th hour of your work week spent writing the research up.


A stronger hypothesis: the main goal of most AI researchers is to make other AI researchers think they are cool, and using a word from classic science fiction helps do that.


That seems obviously false to me, according to what I know about people in general, and researchers in general.


Grokking comes from Heinlein, and it is more than a synonym for both of those words.

Good intro here: https://en.m.wikipedia.org/wiki/Grok


Calling these spikes "epiphanies" might be a better term. I think grok is nice b/c it's only 4 letters.


I'm going to guess that in AI those terms have been so cheapened by now that they are useless.


because, in highly technical research area, there are zillons of somewhat similar effects. Some are interesting, some are not.

They need a word for each interesting effect. Sometimes they make up new words, sometime they reuse a rarely used word


It's also problematic because it implies a subjective "feeling" that we associate with sentience (a property of AGI) - not AI systems.


Yeah, this gratuitous use of fancy words induces cognitive load for me.


Or 'comprehension'.


I really believe phase changes in models are going to tell us a lot about what they’ve learned. Though it has to be said that seeing that sharp improvement in a model during training is the exception rather than the rule — normally you just see gradual improvement until you’ve reached sufficient accuracy and then decide to stop training. So it’s not clear to me that every highly accurate model will contain a phase change.


We know that models like GPT-4 go through phase changes, not in train/test loss, but for loss on specific tasks. This suggests the gradient descent is learning one thing after another, possibly in order of "difficulty".


Generally speaking, a phase transition occurs when the symmetries of a system are forced to change, either becoming restricted or liberated, as the result of some parameter changing, perhaps past some critical value. The parameter can be a number of things, like density (pushing more particles into a box until the gas becomes more of a liquid) or temperature (like removing the total kinetic energy of the particles in a liquid until they can no longer move around eachother).

In the examples of the article, however, the phase transitions occur as the result of duration of training. Time is not a thermodynamic state variable, so these are not phase transitions in the strict sense of the word. However, I think it would be interesting to see the same experiment with something like stochastic gradient descent or the Metropolis algorithm, using some control parameter T to gradually reduce the randomness. This might let us find critical values of T for specific problems, which is useful because that is the value at which you want to spend the most time.

It may also be that for more involved problems, grokking is not achievable by naive gradient descent, as the landscape may be fraught with local minima. Perhaps, in order to find the way to the global minimum, to anneal the system into the lowest free-energy state, it might even be necessary to use a temperature-like control parameter, and mimic nature's own optimisation algorithm.


Thinking about it more thoroughly, I am pretty convinced that the author's conclusion is correct. At the start, the neural net does not encode anything like an FFT or clever modulo arithmetic. The phase space of similarly-behaved weight-values is quite huge, when the neural net is not very sophisticated.

But then eventually gradient descent finds its way into an implementation of FFT and all the rest, which solves the problem exactly. The phase space of weight-values that exhibit this behaviour is far, far smaller than the phase space I described in the previous paragraph. This means that the set of symmetries has been changed, which is the definition of a phase transition.

The reason we see a first-order (sudden) phase transition is perhaps because of the size of the neural net being trained. Perhaps there aren't enough parameters available to encode a half-assed solution, and so there is no continuous transition from clueless to grokked. It either "gets it" or it doesn't.


I have been looking into this as well.

I'm strongly convinced that Chu duality or similar (i.e. the duality between states and transition between states is at play). At first, the model learns only the states but over time it learns the transformations.

Ironically, this is a similar process to QTF renormalization.


You don't talk about human analogs of grokking, and that makes sense for a technical paper like this. Nonetheless, grokking also seems to happen in humans, and everybody has had "Aha!" moments before. Can you maybe comment a bit on the relation to human learning? It seems clear that human grokking is not a process that purely depends on the number of training samples seen but also on the availability of hypotheses. People grok faster if you provide them with symbolic descriptions of what goes on. What are your thoughts on the representation and transfer of the resulting structure, e.g., via language/token streams?


Here, "Grokking" seems to be defined as a quick shift in the degree of generalization of a model. I'm not sure this is in any way an interesting phenomena. The original paper doesn't seem to be referenced [1] and I'd speculate the interest in the "alignment" community.

[1] https://scholar.google.com/scholar?cites=4466239674951044045...


To me the odd takeaway here is that the model eventually converges to a close approximation but never really lands at the core of the problem. Loss never reaches zero even for the memorized set, and the "grokked" loss is usually higher. Wouldn't you expect something totally bounded like "addition mod 113" to be trainable to the point of zero loss? I think the field is looking for that final phase change that takes us from statistical methods to actual understanding.


Not really a characterization of this work (I like mechanistic interpretability even though I consider it a detour in the long run), but of the submission title (edit: it was "Tiny Transformer trained for addition learns bizarre addition algorithm" at the point of this comment being posted). It's bad. I suspect it's informed by Rob Miles' characterization of this algorithm as "bizarre and inhuman"[1]

AI safety/alignment discourse as a whole is so incredibly bad. It's autodidactic and weird in the bad sense, weird like people who think they are brave iconoclasts but just haven't the curiosity and humility to learn the basics of the field are weird. Add some money from Dustin Moskovitz to Open Philanthropy to Robert Miles' propaganda firehose on youtube, add policy lobbying and buying bankrupt crypto podcasters, and you have the worst that could come from the culture of public intellectualism. The notions they have inserted into this potentially vital field, like inner/outer misalignment and "mesaoptimization" [2] (where simple overfitting is the real and more productive explanation) are deeply misleading, technically illiterate and stoking public fears in directions we should care about the least, siphoning resources and attention from real issues informed by specifics of contemporary ML (chiefly data and curriculum engineering). They're close to putting a lid on LLM development – because they're getting spooked by some vague sense of growing "capabilities"; even though LLMs are evidently a vastly better approach to safe-yet-capable-general-AI than anything any of their thought leaders have ever come up with in decades of SIAI/MIRI/Lesswrong procrastination. (For one example of a decent takedown of their approach I recommend [3]).

Why is this algorithm bizarre, or even inhuman? Could anyone show me the specific functional connectivity graph and activation functions in my brain that I use when doing addition, with the circuit going through spatial-quantitative cortex and phonological loop, populations querying cached results in long-term memory, individual numbers marked as interesting or banal by the highest-level associative units? Would it look anything like a neat and sensible algorithm we'd come up with using a global top-down representation of the solution target space?

By this standard, humans are inhuman, humans are shoggoths, as they should be, because emergent data-driven structures in general-purpose substrate generally do not resolve to anything like provable optimality, even if they get very close in performance. This framing adds nothing but clicks and insecurity.

1. https://twitter.com/robertskmiles/status/1663534255249453056

2. https://www.youtube.com/watch?v=zkbPdEHEyEI

3. https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objecti...


I don't think being bizarre or not adds much to the fear of alignment. Whether it is or not, the issue remains the same, i.e we are not the ones controlling the learning process. There isn't really any solace in the other option (surprisingly human) in terms of alignment fears.

Let me put it this way. Unaligned intelligence is dangerous. You don't need a Sci-fi book to realize this, just a history one.

Our history is replete with examples of the slightly more technologically advanced group decimating their competition.

Humanity is directly responsible for the extinction of hundreds of species. Not because of any particular malice or offense, simply that our goals didn't align with the interest of any of those species.

Just looking at Humans alone, Unaligned intelligence is potentially catastrophic even when the advantage gulf is fairly low.

When the advantage gulf is large, it is potentially genocidal.

So personally, "Human SuperIntelligence" doesn't exactly warm the heart.


> Whether it is or not, the issue remains the same, i.e we are not the ones controlling the learning process

I think this is just uninspected, vague intuition. What does it mean to control the learning process? No, we control the data, the deterministic learning rule, weights and activations, every step of the way is controllable – it's just it's intractable to control it all. Moreover, if it were tractable, would we know what to do? How would that be different from the problem of programming an AI from scratch?

> There isn't really any solace in the other option (surprisingly human) in terms of alignment fears.

If there weren't, I suppose major AI doomers (Yudkowsky, Leahy, Besinger etc.) wouldn't have been putting such an emphasis in their Lovecraft-inspired rhetoric on the alienness and inscrutability of those evil matrices of floating point numbers. No, the idea that human mental architecture is safer (and that DL does not approximate it) is very much at the core of alignment fears. E.g. Besinger on why he expects AGI Ruin [1]: "We're building "AI" in the sense of building powerful general search processes (and search processes for search processes), not building "AI" in the sense of building friendly ~humans but in silicon … The key differences between humans and "things that are more easily approximated as random search processes than as humans-plus-a-bit-of-noise" lies in lots of complicated machinery in the human brain. … This doesn't mean the problem is unsolvable; but it means that you either need to reproduce that internal machinery, in a lot of detail, in AI, or you need to build some new kind of machinery that’s safe for reasons other than the specific reasons humans are safe."

> Let me put it this way. Unaligned intelligence is dangerous.

I would ask you not to condescend but I have learned that this is an impossible request with the AI risk crowd, because you never really encounter pushback in your para-academic hothouse and so come to believe that, indeed, pretty basic intuitions are knockdown arguments only silly people could dismiss. You define intelligence in a certain way that has nothing to do with how you identify it in the wild. The definition is something like "intelligence is an optimization process"; the thing under consideration can be an LLM or a diffusion model that seems really good at its job. You mean to say "processes shaping real-world outcomes that are not optimizing for outcomes people consider good are dangerous". This is true but trivial. The onus is on you to tie this consideration to intelligence in general or to "capabilities" of arbitrarily powerful ML models.

1. https://www.lesswrong.com/posts/eaDCgdkbsfGqpWazi/the-basic-...


>No, we control the data, the deterministic learning rule, weights and activations, every step of the way is controllable

In some instances, Saying we control the data is pretty meh when we currently just chock much of the entire web as "data".

>Moreover, if it were tractable, would we know what to do?

Obviously we don't know what to do. Deep learning wouldn't be necessary otherwise.

>How would that be different from the problem of programming an AI from scratch?

You don't understand what would be different from knowing and personally implementing all the processes Intelligent systems use to operate in terms of alignment?

>No, the idea that human mental architecture is safer (and that DL does not approximate it) is very much at the core of alignment fears.

I don't understand your obsession with underpinning your arguments with notes from the lesswrong crowd. There's a big AI safety statement signed by a lot of academics below who have nothing to do with lesswrong. Lesswrong isn't be all end all of alignment fears.

I really don't care about lesswrong. You obviously read the forum a lot more than I do.

>You define intelligence in a certain way that has nothing to do with how you identify it in the wild.

I define intelligence fine. Alignment fears are obviously geared towards General SuperIntelligence that can perform General goals. Nobody thinks Stockfish or AlphaGo is going to end the world.

In recent months there's been a general direction to grant LLMs more and more agency, embodiment and tool control. They're being "plugged" into everything. Palantir even has a military use LLM they're excitedly showcasing. They're several popular repos that give control of your terminal and browser to LLMs.

The moment we built something showcasing strong general intelligence, we began to use it to make cognitive decisions and take actions in our stead.

Now an LLM browses the web for you. And as soon as June with Windows, it'll control your computer for you as well.

I'm not advocating to stop any of this but I'm not delusional about how potentially dangerous the direction is.

If anything this recent LLM era has shown, it's that general AI won't need to escape any "box" because two sides will be missing.


> You don't understand what would be different from knowing and personally implementing

I am arguing that the absence of knowledge makes the absence of control irrelevant. We don't know how to do any better than the learning rule.

> I don't understand your obsession with underpinning your arguments with notes from the lesswrong crowd. There's a big AI safety statement signed by a lot of academics

This is incredibly disingenuous. Hinton or Bengio argue sophomoric ideas that have been discussed to death on Lesswrong decades ago. Your own link would have been impossible without Lesswrong tradition, it's literally written by lesswrongers on a sister community largely hijacked by lesswrongers. Everything of intellectual content that there is to alignment research is basically Lesswrong paradigm, you argue MIRI talking points too. Attempting to abandon it as a low-status association is just nasty.


>I am arguing that the absence of knowledge makes the absence of control irrelevant.

But it doesn't lol. Just because we don't know any better doesn't mean the current way isn't any less potentially dangerous.

>Your own link would have been impossible without Lesswrong tradition, it's literally written by lesswrongers on a sister community largely hijacked by lesswrongers.

This is the point I'm making lol. You knew this, I didn't. I came across something I found interesting and I shared it. That's it. I know alignmentforum even less than I know lesswrong

>Attempting to abandon it as a low-status association is just nasty.

I'm not a lesswronger. I'm not trying to avoid anything, I'm just not. I think you have issues you need have to deal with. Good day


I’m more on the side of not worrying about AI just yet, but seeing a lesswrong devotee complain about “para-intellectual hothouses” got a real chuckle out of me.


I am the opposite of lesswrong devotee. Chuckle at yourself for thinking that posts about "grokking" on alignmentforum.org are anything more than lesswrong with a fresh coat of paint. What next, cite intelligence.org for expertise on strong AI, without realizing it's the same Yudkowsky?


How can the title of the submission (which is just the title of the article) be informed by Miles' description of _the same article_ (which he posted today, nearly a year after the article was posted)?


I didn't personally change it but the title of the post was different when i originally submitted. Didn't call it inhuman though, just bizarre which it is...depending on your frame of reference.


"Tiny Transformer trained for addition learns bizarre addition algorithm" was the title.


That youtube video is basically a crime against humanity. I'm being hyperbolic, but only slightly. I have never seen such blatant misrepresentation of results used to "verify" entirely unrelated hypotheses. Anyone with even a mild understanding of deep learning can spot this; but clearly the audience for these videos is _not_ people with deep learning experience. Rather, it is the followers of the cult of alignment.

At least one aspect of why this is so popular has to be that deep learning has now gotten sufficiently close to the popular conceptions of what a "general AI" should look like according to science fiction. As such, layman are using priors from e.g. Asimov (which deliberately simplifies things for poetic irony and "violations of the laws") such that they feel like they are actually already equipped to handle these relatively advanced topics.


> buying bankrupt crypto podcasters

I haven't heard of anything like that happening; what is this referring to? Are you claiming that a podcast was secretly sponsored rather than natural?


Possibly related: “Observation of Phase Transitions in Spreading Activation Networks” (Science, 1987)

https://drive.google.com/file/d/1axt_8KfP2wd2p2cwMt5IGEfaekU...


Repurposing language to apply it as analogous labels is bad.

As a term of art it really has specific meaning.

Casual reading outside the cognoscenti will misunderstand it to have literal value as "hallucinating" and "learning" already are.

Don't do this. Grokking does not imply intelligence or even understanding. There is no necessary introspection or inductive quality.


Sure ok, there's a bit of namespace conflict now, happens quite a lot. I also don't love it, but it's there now, and it's really not the worst name. It's a good area though, and the paper seems interesting. Do you have any real problems with it?


My only complaint is that it adds to the set of core concept-labels which imply things about AI which aren't correct.


Probably most words for which the origin is known started out as a metaphor.

Examples:

hallucinate - mid 17th century (in the sense ‘be deceived, have illusions’): from Latin hallucinat- ‘gone astray in thought’, from the verb hallucinari, from Greek alussein ‘be uneasy or distraught’.

induction - late Middle English: from Latin inductio(n- ), from the verb inducere ‘lead into’

Wikipedia:

> Metaphor: Change based on similarity between concepts, e.g., mouse "rodent" → "computer device".

> In historical onomasiology or in historical linguistics, a metaphor is defined as a semantic change based on a similarity in form or function between the original concept and the target concept named by a word.

> In cognitive linguistics, conceptual metaphor, or cognitive metaphor, refers to the understanding of one idea, or conceptual domain, in terms of another. An example of this is the understanding of quantity in terms of directionality (e.g. "the price of peace is rising") or the understanding of time in terms of money (e.g. "I spent time at work today").


I get this, but I worry that the metaphorical quality of hallucinate implies belief in consciousness. It may have surface qualities analogous to what we call hallucination in conscious beings, but to backwards intuit because we decide this AI is "hallucinating" implies it's alive worries me.

In particular, that some theory of mind in a concept derived from AI contextual use of "hallucinate" could inform eg disease aeteology. I'd be delighted if our understanding of the mind and consciousness meant we had a theory which related to algorithmic hallucination but I don't believe that's true. We've taken surface effects and (I argue mis-) labelled them. There is no demonstrated link between AI hallucination and why it happens, and what happens in a mind. Why? Because we don't entirely understand what consciousness is. So, we have weak theories, based on experiment and observation. AI models aren't (as far as I know) informing this.

Hallucination (the AI term meaning) is an emergent effect in systems. And that doesn't imply anything about it, and how it relates to real hallucination in real minds.

Grokking is to humans understanding. Making intuitive connections, corollaries, inductive reasoning, abstraction, reapplication. It's strongly tied to cognition. I don't like this use of the term. While machine generated outputs display wildly inappropriate judgement between facts, and instantiate lies as syllogistic consequences it's way way off grokking as humans do.

So ask yourself: do you think pop Sci pundits and journalists and thus politicians understand this? Or do you think they hear "hallucinate" and think it's evidence of belief it's consciousness?


Arxiv preprint:

https://arxiv.org/pdf/2301.05217.pdf

OpenReview.net review:

https://openreview.net/forum?id=9XFSbDPmdW

I'm skimming the paper and the OpenReview reviews [1]. The grokking claim is, as always, easily dismissed: it obtains on a small, artificial dataset when training a small, contrived model. There's no reason to get excited until results generalise to larger models and real-world architectures, trained on real-world datasets. Unfortunately this is a lot to ask because the "mechanistic interpretability" approach doesn't seem to have the power to analyse such mega-structures. So I don't think anyone has learned anything about "grokking" from the paper: those who want to believe, will still believe, those who want evidence, will not find evidence.

I'm more interested in the claims of algorithm learning. There's an obvious test that the authros haven't tried: hand-code the algorithm they reverse-engineered and compare it to the trained model.

To clarify, the authors claim that they reverse engineered a certain algorithm to perform modular arithmetic off the weights of their trained neural net that was showing 100% performance on their training set. The algorithm, which they call the "Fourrier multiplication algorithm" is outlined in figure 1 and further elucidated in Section 3.1. The algorithm goes over some frequencies, five of which seem to be predictive of most of the model's performance.

So both the algorithm and its parameters are known (or anyway assumed). In that case, it should be possible to hand-code the algorithm and run it [2]. Then, given some set of inputs X, taken from the experiments' test set, first a) test the algorithm's accuracy and then b) compare the algorithm's output to the outputs of the model, given the same inputs. The point of this is to check, respectively, that a) the proposed algorithm is a correct addition algorithm (and that it is an algorithm, i.e. it always terminates) and b) that its behaviour matches exactly the behaviour of the model it is claimed to be encoded in.

Note that the authors base their certainty on the correctness of the algorithm, and its fidelity as an algorithmic encoding of their model, to a series of ablation studies. It's genuinely refreshing to observe this almost extinct species in the wild, but it is not enough. Not when there is a stronger test available (the one I point out above). I'm almost disappointed to see an ablation study, because it looks like an excuse to not go the whole hog.

In the same spirit, what is entirely missing is a proof of correctness of the algorithm. I have never seen this algorithm before and eyballing it, it "looks OK", but that's not how algorithmic correctness is normally ascertained. There's all this maths in the paper, but not a proof in sight. What's the point of maths then, to calculate large numbers?

_________________________________

[1] The paper was accepted for publication in ICLR 2023 so congratulations to the authors.

[2] To be perfectly clear about what I mean: I mean it should be possible to code the algorithm as a computer program in a language like, e.g., Python. The "algorithm" is an encoding of the trained model in a different, but equivalent form. It's now possible to encode it in yet another different, but equivalent, form, and run it, to perform "inference".


Implemented addition via:

      cos(wx) sin (wy) + sin(wx) cos(wy)

But how did it compute that intermediate addition?


What it learned to do was modular addition, not addition.


I think people should read at least slightly reputable ML papers instead of AI papers by crazy autodidactic AI experts who are making up all of their terms and won't stop telling you it's going to take over the world because of some math they did once.


Good news for readers who prefer reputable ML papers: this work was accepted to ICLR 2023! https://openreview.net/forum?id=9XFSbDPmdW


Thanks! What a self-own, ICLR Spotlight paper, that's close to the top of the reputability food chain!


tell me you're posting from an armchair without telling me you're posting from an armchair




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: