I get where you're coming from but I think you're missing the point. The issue seems to be that even if there are good contributions coming in their benefit is disproportionately outweighed by the work and disruption caused by the large volume of spammy PR submissions.
As an open source maintainer, I'd gladly sift through a mountain of spammy PRs (heck closing 4 per hour, as called out in the article, is almost zero trouble), if it means even a handful of real significant progress and issues fixed and potential future maintainers.
+1. I feel the same way. If one out of 20 drive-by contributors stick around and become regular, that would be a real win for me. (I'm currently maintaining a project with 20k GitHub stars and we have four regular contributors.)
A snapshot of one year's participation in that specific period isn't too relevant to what we are discussing, because it doesn't track sustained future contributions by those same users.
In theory they are. In practise the lack of a test dataset - and the lack of access to their dataset - means it's virtually impossible for a third party to make any significant contribution to the data processing code.
Such an effort would have to start with them voluneering a test dataset and/or schema.
I'm not missing that point, I'm asking whether it's true. The right way to answer that would probably be to find out how many of the new submitters from previous years went on to continue to become valuable contributors.