Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Understanding the generalization of ‘lottery tickets’ in neural networks (facebook.com)
98 points by hongzi on Nov 26, 2019 | hide | past | favorite | 16 comments


For anyone looking for a quick summary of what a lottery ticket is, how they're found, and some implications, here's what I remember from a I talk I saw about lottery tickets at EMNLP 2019:

- The core hypothesis is that within "over-parameterized"[0] networks, a small subnetwork (a small set of weights) is often doing most of the work. A weight can be thought of as an edge in the neural network graph, and so subsets of weights can be thought of as subnetworks.

- You can find these subnetworks by initializing weights randomly, training for some number of steps, identifying the least contributing weights, and then retraining from the same initial parameters as before, except with the least-contributing weights from the previous run zero'd out.

- People have observed that you can achieve something like 99% weight pruning with relatively little loss in performance. After that, things get very unstable.

- This has implications for understanding how neural nets do what they do and for shrinking model sizes.

This is all from memory, so forgive any errors.

[0] - Networks have grown to enormous numbers of parameters lately, and there's reason to think that even before the era of 1B+ parameter networks, neural nets were over-parameterized. Why do we use such large networks then? For a given trained neural net, there might be a much smaller one that does the same thing, but in practice, it's difficult to get the same performance by training a smaller network. This may be because our typical optimization methods aren't well suited for finding these lottery tickets.


Maybe the additional parameters give the entire network more "leeway" to find such subnetwork structures, i.e. ease gradient descent by smoothing the loss landscape?


That's intuitive but doesn't support the result of the lucky subnetwork (once found and re-initialized) training faster and outperforming the original.


This does not seem to be a contradiction. Once you are in the right region of solution space training is expected to be faster and easier. Re-initialization could have a regularizing effect, explaining the better performance.


They re-use the same initialization, so it appears that the initial weights are inherently coupled with the nonzero structure.


> Why do we use such large networks then? For a given trained neural net, there might be a much smaller one that does the same thing, but in practice, it's difficult to get the same performance by training a smaller network.

Also, because it's cheap to run a network with a billion parameters. Lots of real-world applications are constrained not by compute, but by size of the training set.


Are the results different from just doing something like L1 regularization?


I've always wondered this and can never find a satisfactory answer online. You'd think that if ReLU works then LASSO shouldn't present any problems either right?

In addition L2 tends to discourage sparsity by spreading out the influence of weights, which seems antithetical to the mission of pruning. (for example, if you run a ridge regression with two identical features the L2 penalty will assign equal coefficients to both instead of zeroing one out like L1 does)


What does parameter mean in this context? Clearly we arent looking at 1B+ 'features' in order to see which are predictive, right? That seems infeasible.


Parameters are what we update during training. One parameter is one scalar among the learned variables of the model.

Audio-visual inputs do sometimes end up having 1B+ individual values, but there isn't necessarily a 1:1 relationship between input size and parameter count. In many deep neural nets, most of the weights are internal. They're used to process the outputs of earlier layers of the network. This is where the term "deep" comes from.


How is this different from a sparse autoencoder?


While the concept of "compression" connects them, these are quite different types of thing.

In an autoencoder, you learn a representation for your data. So you'll get a function mapping from, say an image, to a representation vector. Then you can optimize for desirable qualities for this representation vector, such as sparsity. This will hopefully result in meaningful axes in the representation space, which kind of indicate the presence or absence of different aspects in the input (e.g. whether the input contains a face with glasses or without, etc.). This is an unsupervised approach and results in a (typically lossy) compressed version of the input.

In network pruning you take a previously trained (typically, but not always, in a supervised manner) network and remove individual parameters (weights) from it (i.e. set them to zero) while trying to preserve as much accuracy as possible. Here the trained network itself is compressed.


It seems like that should be expected. The point is to have a ton of data with possible patterns to find and you have no idea which data links to what other data. You'd expect most data to not correlate. You'd expect strong patterns to emerge between a subset of the parameters. You'd expect to be able to prune off the useless parameters after the fact.


The next question would be: is it possible to compare across multiple lottery ticket networks, and look for commonality? Maybe the lottery ticket params all give rise to some small set of particularly efficient configurations that we can add directly to other networks?


I think the main question I have about these smaller, "lottery ticket" networks is that they are being trained over and over on the same problem as the bigger network, and are being evaluated on the same dataset as the big network, which leads me to believe that the model will fail to generalize to different but still related problems. Like if the model was trained on Imagenet an the winning ticket was found that had ridiculous accuracy for a relatively small network, I would think it would be heavily overfitted


The article discusses this as its main theme and shows generalization across datasets.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: