Super fascinating post. For those who go straight to the comments, it appears that someone managed to train a BERT model[1] to 90% of the GLUE score reported in the original BERT paper on a single GPU in ~100 hours. Note that this includes pre-training!
I can't find a clear source on time and compute used for the original BERT pretraining run, but it's clear than this is at least two orders of magnitude less hardware and roughly similar wall time.
I wonder how much of this could be translated over to the pretraining phase for a GPT?
I wonder if the SOPHIA[1] optimizer would also help here?
I'd argue that the research work being done to push these ML models into the realm of practicality on smaller hardware is just as important as the foundation that it relies on.
Not just a single GPU, a very middle-of-the-road GPU. It's not like it's even a 40X0. A naive spec comparison says a 4090 would halve the training time for the same result, although I'm not 100% sure it would quite play out like that. It's Amdahl's law versus more GPU RAM, and I don't know which would win.
> While we can see that BERT-Base performed better at every task; the results for this model would have been very good (possibly SOTA for a few tasks) in early 2018.
SOTA - meaning no models of _any_ size at that time could do this. And they're achieving that now on a single consumer GPU with 1/30th the data and 1/40th the epochs. That's huge for improvements to training efficiency.
The achievement of training a BERT model to 90% of the GLUE score on a single GPU in ~100 hours is indeed impressive. As for the original BERT pretraining run, the paper [1] mentions that the pretraining took 4 days on 16 TPU chips for the BERT-Base model and 4 days on 64 TPU chips for the BERT-Large model.
Regarding the translation of these techniques to the pretraining phase for a GPT model, it is possible that some of the optimizations and techniques used for BERT could be applied to GPT as well. However, the specific architecture and training objectives of GPT might require different approaches or additional optimizations. With the help of MirrorThink.ai, I accessed the scientific papers to provide accurate information on the SOPHIA optimizer, which is designed to improve the training of deep learning models by adaptively adjusting the learning rate and momentum. According to the paper [2], SOPHIA has shown promising results in various deep learning tasks. It is possible that the SOPHIA optimizer could help improve the training of BERT and GPT models, but further research and experimentation would be needed to confirm its effectiveness in these specific cases.
It's just been quite useful in my own R&D work, so thought I would try it a few times today to connect the discussion to actual primary sources from research.
Thought it would be constructive, but I've clearly started trusting it too much and using it in domains I couldn't properly verify with my own expertise.
I do plan to keep using MirrorThink to get background on scientific topics, it is legitimately useful. But yeah, definitely not ready for actually generating proper replies.
Conceptually there should be a predictable tradeoff curve between memory size and execution time. This could be quite useful (e.g. if 100hrs is ok, so are 200hrs), after all you don't train such a model every day.
But in practice this curve is probably very non-linear in the low end. This writeup shows nicely how lumpy the various steps of loading kernels, model etc
Memory size mostly depends on the batch size. With a smaller batch size you usually learn more per sample but it takes more time per sample. When I was trying to make a foundation model for clinical notes around 2015 with LSTM I found a batch size of 1 worked best on a CPU but a much larger batch size was ideal in GPU. That is, the CPU would not be faster in terms of samples/second if I increased the batch size but the GPU would get better. What I really cared about was calendar time not sample efficiency, the CPU did best when I optimized for sample efficiency, the GPU sped up enough with large batches that I didn’t mind showing the network more samples.
That would be only true if gamer have a relevant overlap.
While it's a nice Sidebonus for people gaming, I would argue a better point is the current Nvidia strategy to starve their line of 40* with memory and it's much easier to get a 8gb card than anything else.
If you want to encourage grass roots learning interest millions of gamers with 8gb cards is a better starting point. Even if only a small percentage go for it that's still million of people.
Meanwhile the number of people that has the luxury of shelling out for a 4090 straight away for AI specifically is pretty small.
It's a bit like raspberry pis helped the tinkering world even though they're realistically slow AF. (and now unobtainable sigh)
Impressive feat. Honest question though —
What are the reasons to pay attention to BERT, when there is GPT4 for various use cases? Does it come down to cost, latency and privacy (vs using OpenAI API)?
Genuinely curious, if I am missing a compelling use case outside of those reasons.
If we want to learn to build something better than GPT-4, research that shows how to train older models like BERT using a fraction of the hardware that was used when that model was originally training is enormously valuable.
BERT is 500x smaller than GPT3, it doesn't hallucinate (it might make wrong predictions, but it won't make unexpectedly wrong predictions), it can run on a CPU in 500ms. I really like BERT style models (there are a couple newer models that better than BERT but are very similar)
This very interesting to know, thank you. I know there are some frameworks to easily switch between closed and open/local models, so I will have to compare how this does vs GPT-4
BERT is not a generative model (you might be able to force it to do that, but that's not what it's meant for). It's good for text classification tasks, for named entity recognition and other tasks like this
I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.
> I am sure a lot of work (and time) went into this but what's the point? If the model is not as good and the original then the applications are limited and the training time is moot.
If the several orders of magnitude (two in this case) improvement can be used to train other models, that's quite a thing no?
It might not be as good on the same benchmark as the original BERT, but pretraining means you can expose it to entirely new corpuses of text e.g. medical or obscure programming languages. This has the potential to make it much easier to create domain-specific adaptations of BERT.
Does anyone know whether this training run used FP16 or BF16? This GPU has tensor cores which dramatically accelerate DL - if they are being used. The post doesn’t mention it.
If we do the training from scratch what happened when power is down in the middle of the training? Is the the training should be restarted from the beginning?
My point was that many people want to buy or already own AMD GPUs with 8gigs or more and they're basically excluded from playing with the hip ML stuff unless they jump through insane hoops because all the hip ML stuff is done on Nvidia GPUs.
It wasn't your point, but it was mine. In terms of raw power this should be in range of AMD hardware. Needs porting work, yes, but I have a feeling the hoops ought to be less insane than all that. It's not "recompile with a different library", quite, but if you take a look at the sort of portability work going on in (for instance) llama.cpp you can see reasons for optimism.
A bit of both, NV has put a ton of work in their software libraries which makes doing some stuff on NV a viable option, you'd have to do a lot of work to get to an equivalent level on AMD.
But if all you need is PyTorch support then AMD may well be a viable option.
i've always been curious about looking at the statistics of comments/points of hn posts. so this is up 10h, has 57 points and not a single comment so far. fascinating.
For what it's worth, I think the lack of comments is due to the highly academic nature of the experiment displayed in TFA.
LLMs are all of the rage now, and the pre-training techniques used for BERT ended up laying the foundation for what became GPT pre-training runs. But I would assume even among HN users, there really are only a small handful of people interacting with this side of the NLP craze on a regular basis.
Compare that with the amount of users who probably have at least dabbled in using a programming language, and suddenly you see why the opinions fly everywhere.
A lot of the practical aspects of training neural nets are still more "I'm doing it this way because Radford et al did it" or "when you picture the problem in three dimensions, this generally seems like it should work" instead of "This was demonstrated to be the best way and I can show you it's the best way from first principles." because quite frankly that's the best MO we have when dealing with the statistical principles guiding billions of matrix multiplies for trillions of different token sequences.
I'm interested in the topic but have no understanding in depth. I like upvoting stories like these to revisit them after a day when there are hopefully some interesting comments and/or clarifications.
I really like the high signal to noise ratio in HN comments.
So do I. Often I upvote interesting stories with little or no comments if I would like to read a discussion on the topic but I don't have anything worth to contributing personally.
I can't find a clear source on time and compute used for the original BERT pretraining run, but it's clear than this is at least two orders of magnitude less hardware and roughly similar wall time.
I wonder how much of this could be translated over to the pretraining phase for a GPT?
I wonder if the SOPHIA[1] optimizer would also help here?
I'd argue that the research work being done to push these ML models into the realm of practicality on smaller hardware is just as important as the foundation that it relies on.
1: https://arxiv.org/pdf/2305.14342.pdf