Notes on training BERT from scratch on an 8GB consumer GPU

dwrodri · on June 2, 2023

Super fascinating post. For those who go straight to the comments, it appears that someone managed to train a BERT model[1] to 90% of the GLUE score reported in the original BERT paper on a single GPU in ~100 hours. Note that this includes pre-training!

I can't find a clear source on time and compute used for the original BERT pretraining run, but it's clear than this is at least two orders of magnitude less hardware and roughly similar wall time.

I wonder how much of this could be translated over to the pretraining phase for a GPT?

I wonder if the SOPHIA[1] optimizer would also help here?

I'd argue that the research work being done to push these ML models into the realm of practicality on smaller hardware is just as important as the foundation that it relies on.

1: https://arxiv.org/pdf/2305.14342.pdf

regularfry · on June 2, 2023

Not just a single GPU, a very middle-of-the-road GPU. It's not like it's even a 40X0. A naive spec comparison says a 4090 would halve the training time for the same result, although I'm not 100% sure it would quite play out like that. It's Amdahl's law versus more GPU RAM, and I don't know which would win.

lloeki · on June 2, 2023

> it appears that someone managed to train a BERT model[1] to 90% of the GLUE score reported in the original BERT paper on a single GPU in ~100 hours

Wondering what the GLUE score/training time curve would look like.

londons_explore · on June 2, 2023

indeed - 90% of the GLUE score might be nothing to get excited about if the last 10% is the difficult bit...

nighthawk454 · on June 2, 2023

That's not what's impressive here.

> While we can see that BERT-Base performed better at every task; the results for this model would have been very good (possibly SOTA for a few tasks) in early 2018.

SOTA - meaning no models of _any_ size at that time could do this. And they're achieving that now on a single consumer GPU with 1/30th the data and 1/40th the epochs. That's huge for improvements to training efficiency.

oersted · on June 2, 2023

The achievement of training a BERT model to 90% of the GLUE score on a single GPU in ~100 hours is indeed impressive. As for the original BERT pretraining run, the paper [1] mentions that the pretraining took 4 days on 16 TPU chips for the BERT-Base model and 4 days on 64 TPU chips for the BERT-Large model.

Regarding the translation of these techniques to the pretraining phase for a GPT model, it is possible that some of the optimizations and techniques used for BERT could be applied to GPT as well. However, the specific architecture and training objectives of GPT might require different approaches or additional optimizations. With the help of MirrorThink.ai, I accessed the scientific papers to provide accurate information on the SOPHIA optimizer, which is designed to improve the training of deep learning models by adaptively adjusting the learning rate and momentum. According to the paper [2], SOPHIA has shown promising results in various deep learning tasks. It is possible that the SOPHIA optimizer could help improve the training of BERT and GPT models, but further research and experimentation would be needed to confirm its effectiveness in these specific cases.

[1] https://arxiv.org/abs/1810.04805

[2] https://arxiv.org/pdf/2305.14342.pdf

CobaltFire · on June 2, 2023

Maybe I’m wrong (I haven’t used ChatGPT) but this really feels like the examples of that output.

Is that why it’s getting downvoted?

yencabulator · on June 2, 2023

It seems like the account started advertising the same company over and over, with generated content: https://news.ycombinator.com/threads?id=oersted

oersted · on June 2, 2023

It's just been quite useful in my own R&D work, so thought I would try it a few times today to connect the discussion to actual primary sources from research.

Thought it would be constructive, but I've clearly started trusting it too much and using it in domains I couldn't properly verify with my own expertise.

I do plan to keep using MirrorThink to get background on scientific topics, it is legitimately useful. But yeah, definitely not ready for actually generating proper replies.

GaggiX · on June 2, 2023

https://www.mosaicml.com/blog/mosaicbert

I think an useful article for people who want to train BERT from scratch with 20$ (in this case by renting GPUs).

This model has also an actual good GLUE score.

nologic01 · on June 2, 2023

Conceptually there should be a predictable tradeoff curve between memory size and execution time. This could be quite useful (e.g. if 100hrs is ok, so are 200hrs), after all you don't train such a model every day.

But in practice this curve is probably very non-linear in the low end. This writeup shows nicely how lumpy the various steps of loading kernels, model etc

https://huggingface.co/docs/transformers/perf_train_gpu_one

PaulHoule · on June 2, 2023

Memory size mostly depends on the batch size. With a smaller batch size you usually learn more per sample but it takes more time per sample. When I was trying to make a foundation model for clinical notes around 2015 with LSTM I found a batch size of 1 worked best on a CPU but a much larger batch size was ideal in GPU. That is, the CPU would not be faster in terms of samples/second if I increased the batch size but the GPU would get better. What I really cared about was calendar time not sample efficiency, the CPU did best when I optimized for sample efficiency, the GPU sped up enough with large batches that I didn’t mind showing the network more samples.

Havoc · on June 2, 2023

Nice to see movement in the 8GB space. Not as sexy as the bigger stuff but still matters. As reminder these are steam hardware survey stats:

6gb - 19%

8gb - 28%

12gb - 11%

rest, mostly <6gb

So staying below 8 helps open thing up to a lot of tinkerers in a sorta democratic sense.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

Zemtomo · on June 2, 2023

That would be only true if gamer have a relevant overlap.

While it's a nice Sidebonus for people gaming, I would argue a better point is the current Nvidia strategy to starve their line of 40* with memory and it's much easier to get a 8gb card than anything else.

I got a 4090 primarily for the 24gb.

Havoc · on June 2, 2023

If you want to encourage grass roots learning interest millions of gamers with 8gb cards is a better starting point. Even if only a small percentage go for it that's still million of people.

Meanwhile the number of people that has the luxury of shelling out for a 4090 straight away for AI specifically is pretty small.

It's a bit like raspberry pis helped the tinkering world even though they're realistically slow AF. (and now unobtainable sigh)

Zemtomo · on June 2, 2023

I totally agree.

I just wanted to highlight the Rtx memory issue.

d4rkp4ttern · on June 2, 2023

Impressive feat. Honest question though — What are the reasons to pay attention to BERT, when there is GPT4 for various use cases? Does it come down to cost, latency and privacy (vs using OpenAI API)?

Genuinely curious, if I am missing a compelling use case outside of those reasons.

simonw · on June 2, 2023

If we want to learn to build something better than GPT-4, research that shows how to train older models like BERT using a fraction of the hardware that was used when that model was originally training is enormously valuable.

d4rkp4ttern · on June 2, 2023

Ok yes totally agreed!

rolisz · on June 2, 2023

BERT is 500x smaller than GPT3, it doesn't hallucinate (it might make wrong predictions, but it won't make unexpectedly wrong predictions), it can run on a CPU in 500ms. I really like BERT style models (there are a couple newer models that better than BERT but are very similar)

d4rkp4ttern · on June 2, 2023

This very interesting to know, thank you. I know there are some frameworks to easily switch between closed and open/local models, so I will have to compare how this does vs GPT-4

rolisz · on June 2, 2023

BERT is not a generative model (you might be able to force it to do that, but that's not what it's meant for). It's good for text classification tasks, for named entity recognition and other tasks like this

curiousgal · on June 2, 2023

I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.

TacticalCoder · on June 2, 2023

> I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.

If the several orders of magnitude (two in this case) improvement can be used to train other models, that's quite a thing no?

potatoman22 · on June 2, 2023

It might not be as good on the same benchmark as the original BERT, but pretraining means you can expose it to entirely new corpuses of text e.g. medical or obscure programming languages. This has the potential to make it much easier to create domain-specific adaptations of BERT.

bcatanzaro · on June 2, 2023

Does anyone know whether this training run used FP16 or BF16? This GPU has tensor cores which dramatically accelerate DL - if they are being used. The post doesn’t mention it.

sireat · on June 2, 2023

This is impressive indeed.

I trained BERT from scratch on Colab's K80s spending a few hours (not 100h as in the article) on a much smaller corpus.

My results were understandably rather horrible.

zakki · on June 2, 2023

If we do the training from scratch what happened when power is down in the middle of the training? Is the the training should be restarted from the beginning?

babas · on June 2, 2023

You usually save checkpoints at some interval. You can use them to continue the training.

ChuckNorris89 · on June 2, 2023

Should specify Nvidia GPU. AMD is notoriously absent from all this ML progress.

regularfry · on June 2, 2023

It is, but pay attention to which GPU. It's a 3060Ti - good, but not exactly top of the line.

ChuckNorris89 · on June 2, 2023

That wasn't my point.

My point was that many people want to buy or already own AMD GPUs with 8gigs or more and they're basically excluded from playing with the hip ML stuff unless they jump through insane hoops because all the hip ML stuff is done on Nvidia GPUs.

regularfry · on June 2, 2023

It wasn't your point, but it was mine. In terms of raw power this should be in range of AMD hardware. Needs porting work, yes, but I have a feeling the hoops ought to be less insane than all that. It's not "recompile with a different library", quite, but if you take a look at the sort of portability work going on in (for instance) llama.cpp you can see reasons for optimism.

giomasce · on June 2, 2023

Is that because AMD GPUs are technically inferior for ML, or because of tooling is only written for NVIDIA?

jacquesm · on June 2, 2023

A bit of both, NV has put a ton of work in their software libraries which makes doing some stuff on NV a viable option, you'd have to do a lot of work to get to an equivalent level on AMD.

But if all you need is PyTorch support then AMD may well be a viable option.

Zetobal · on June 2, 2023

Maybe AMD should start supporting their older cards in ROCm...

Trapais · on June 5, 2023

Maybe they should support their modern consumer cards in ROCm. Maybe their ROCm documentation should not suck balls.

I'd say there is a reason AMD is a laughing stock in ML, but it's incorrect. There's not a reason, there are tons of reasons to never touch it.

inawarminister · on June 2, 2023

Hopefully tinygrad new drivers will fix this.

Also a lot of people were porting stable diffusion/Llama to AMD I think

2-718-281-828 · on June 2, 2023

i've always been curious about looking at the statistics of comments/points of hn posts. so this is up 10h, has 57 points and not a single comment so far. fascinating.

dwrodri · on June 2, 2023

For what it's worth, I think the lack of comments is due to the highly academic nature of the experiment displayed in TFA.

LLMs are all of the rage now, and the pre-training techniques used for BERT ended up laying the foundation for what became GPT pre-training runs. But I would assume even among HN users, there really are only a small handful of people interacting with this side of the NLP craze on a regular basis.

Compare that with the amount of users who probably have at least dabbled in using a programming language, and suddenly you see why the opinions fly everywhere.

A lot of the practical aspects of training neural nets are still more "I'm doing it this way because Radford et al did it" or "when you picture the problem in three dimensions, this generally seems like it should work" instead of "This was demonstrated to be the best way and I can show you it's the best way from first principles." because quite frankly that's the best MO we have when dealing with the statistical principles guiding billions of matrix multiplies for trillions of different token sequences.

dx034 · on June 2, 2023

I'm interested in the topic but have no understanding in depth. I like upvoting stories like these to revisit them after a day when there are hopefully some interesting comments and/or clarifications.

I really like the high signal to noise ratio in HN comments.

gpderetta · on June 2, 2023

So do I. Often I upvote interesting stories with little or no comments if I would like to read a discussion on the topic but I don't have anything worth to contributing personally.

Zemtomo · on June 2, 2023

US is still sleeping btw.