Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The social side of science seen in the research on programming language quality (hillelwayne.com)
137 points by hwayne on March 10, 2020 | hide | past | favorite | 84 comments


Programming and programming languages are in a big portion subject of social sciences anyway. Made for the needs of humans used by humans in groups with practices and methods addressing the particularities of humans and human groups. Not ordinary humans but humans indeed. Probably it is more of social science than mathematics or other human independent matters. Social science that is using other disciplines, similar to economics, but at least an interdisciplinary subject.


So true! Worth noting that Larry Wall - creator of Perl - is a linguist, it shows in how the language permits (and encourages) diverse modes of expression ;-). Also, Yukihiro Matsumoto's motivation in creating Ruby was focused on fulfilling a basic human need - to bring programmers happiness.


Python binds semantic structure to visual appearance (indentation defines blocks)


I just can not agree with this concept. Using the standard you propose, architecture for instance is a social science, and I’d argue most non-computer engineering fields would be classified as social sciences as well. Computer Science is by traditional applied mathematics, at least the theoretical courses of study treat it so. But there is a reason Computer Engineering is the predominantly understood title of the overarching software field. Programming languages do not exist in some philosophical vacuum, they are created to be implemented in the real world, on real hardware, and deal with the very real physical capabilities and drawbacks of that real world hardware. While programming, and software design as it is generally practiceD, is not as rigorous as other engineering fields, it is the application of the sciences to create or implement technology. So I just can not agree that programming languages and programming are in any way a social science.

Also, just to be clear, there can be social science fields which inquire into aspects of the engineering field of Computer Science, but that doesn’t effect the underlying character of the field.

Also, also, I would strenuously argue against the proposition that mathematics is in any way ‘human independent.’ But that argument is not entirely relevant.


Architecture is very much about culture. I'm not sure I'd call it a social science, but the science part of architecture - making sure that buildings don't fall down - is left to structural engineers.

If you read how architects justify their designs, they're full of references to culture, philosophy, and occasionally - too rarely - practical research into how people move through and live/work in spaces.

It's a social activity in that architects design for other architects. Clients are secondary.

CS is actually no different, it's just far less culturally self-aware. There's nothing objective about CS. It grew out of a specific philosophical current, and it could easily have gone in other directions, emphasising other kinds of entity relationships - e.g. less atomised and sequential, more contextual and associative. That in turn would have influenced the design of the hardware.


You can say the exact same thing about mathematics.


> But there is a reason Computer Engineering is the predominantly understood title of the overarching software field.

This is not true. The software field side also falls under computer science. Computer engineering is understood to mean the engineering of computers. Someone with a degree in computer engineering will have a strong electrical engineering background as well as digital circuit design, plus the lower-level side of software (operating systems, device drivers, general systems).


I half agree and half disagree.

I’m assuming that by architecture you’re referring to system/enterprise architecture and not eg CPU architecture. For a given system, the viable solution space is massive, and the decision on which approach to use is a mixture of science (analyzing and meeting numerically-defined requirements), art (meeting soft requirements), politics (making everyone happy about the unquantifiable choices), and taste (choosing between different similar solutions when no other constraints makes one better than the other). Two organizations facing the exact same problem can come up with two vastly different solutions that both effectively solve the same problem.

How those decisions get made is, I believe, an extraordinarily interesting social sciences question!

There are huge seats of computer science that do, to me, fall under the umbrella of “hard science”, but not all of it. Philosophy, taste, and social dynamics play a huge role in “industrial computer engineering”, and to try to pretend that’s not the case is setting ourselves up for failure.


This is a great take, and one that I wish framed more of the research in programming languages than is currently the case.


Where I see social sciences is on the formalization of ontologies for specific applications or laying the ground work for general AI. Perhaps some thought will help us go from higher concepts of knowledge to imperative languages, which most programming languages still are. We often talk about learning systems that gain knowledge and improve themselves. But there is probably more to it than adding another table row. Given, that is probably very hard.

That said, I think the article mentions science being "social" in the context of how we weight scientific research depending on the authors standing and expression, not the absolute truth behind it. Which actually is probably quite a shame.


I wonder why you decided to contrast programming and programming languages with mathematics. Your second sentence makes perfect sense if you replace programming with math.


Wow, what a bad idea. Do Haskell and PHP get used for the same sort of projects? Or Erlang and Ruby? No, they get used for very different sorts of projects. You would need to only compare similar kinds of projects (e.g. web frameworks, or statistical analysis, or device drivers) to get anything useful out of this data source. But, that's probably not possible, because in most cases there aren't enough projects in very many languages. This also assumes that bugs are all equally likely to get FOUND in all languages. What if bugs are easier to find in language X than in language Y? Then you will get MORE, not fewer, commits that mention fixing them in language X, which is the opposite of what you would want the signal to indicate. I could go on, but there's no need, as even these points are enough to totally invalidate the premise of this research. Getting into the weeds of the analysis of the original study or the rebuttal, is not going to give you anything useful, if the original premise that you are measuring how language compares to bugs is wrong, and it almost certainly is.


Confounding variables don't invalidate research, they merely inform us on how to weigh the evidence, and shape future research.

Studies can have interesting conclusions even in the presence of confounding variables. Sometimes the state of the art in a certain hypothesis space hasn't progressed past of of these yet, but that's still better than gut feeling.

In some areas like public policy research you will never completely eliminate all of these factors, but that's no reason not to attempt to create knowledge.


Sure, you will never completely eliminate all of these factors. But this is far, far short of "not completely eliminated". This is, "the confounding variables are almost certainly much larger effects than what you are trying to measure". The difference between python, ruby, and php in make a web framework, is much much smaller than the difference between web frameworks, statistical analysis packages, machine learning, and devops scripts. Python is used for the last three, ruby and php much less so. The difference between python and ruby and php is therefore going to be dominated by the kinds of tasks they are used on, and the difference between the languages is small compared to that.

Smokers are routinely excluded from analysis of cancer rates, for a similar reason; the effect is so large that if (for example) you're looking at diet, and the difference in diet is correlated with a difference in smoking rates, any findings are invalid.


I don't think GitHub mining is state of the art, and there are opportunities, as the article says, to separate the signal from the noise. I agree in the general sense, but there should be some lines drawn at which point we can say research is invalid so people don't run away with the results.


Yeah, the original study is clearly so methodological unsound that it's conclusions shouldn't be repeated.

But fankly most of the posts here are people almost calling the original authors heretical for even trying to data mine commit messages.


> What if bugs are easier to find in language X than in language Y?

I am half with you on this but thinking to SQL, its probably harder to find bugs in SQL than in an imperative language language, but its also a lot harder to create bugs - its a declarative language so doesn't allow you to create whole classes of bugs that you will see in an imperative language.

Personally I try to push logic towards the database for the very reason that it's less buggy in general in my experience.


Logic in the database has been less buggy in my experience too... except the time when someone had the idea to push a lot more logic into that layer, at which point it became equally buggy, and much harder to debug.


At the very end of this grand tour:

> Science has its problems, but it’s still the best we got.

I would agree if you're doing particle physics and looking for p < 0.0000003 then absolutely. Science is a great tool for the kinds of investigations of the natural world which can be understood by science. If you're doing social science by scraping some data from the internet and looking for p < 0.05, that's a very different story.

I'm not actually sure what aspect of this qualifies as "science", besides using p-values to analyze the data. Even if all the categorization and sampling and analyses were perfect, it's using commit messages and known bugs with known fixes, all of which are essentially self-reported.


It's worth remembering that even physics isn't immune to this, it doesn't hit headlines often but there's certainly been criticism and dialogue orbiting the various domains of physics that exist at a largely theoretical level and impossible to disprove level. At the same time, the history of physics has actual examples of situations where theory was decades ahead of the discoveries that would facilitate validating the claims and predictions being made. It all boils down to if an approach or school of thought provides utility or other value beyond what's currently available.

Philosophy had almost the opposite problem at one time ironically, the concern that theory was gaining precision but becoming unhelpful for formulating comprehensible statements (in the non-mathematical sense) about the human experience.

The greater problem is the politicalization of science for which I'll use Gender Studies, though this is not to say Gender Studies is the only politicized field nor that it is the most egregious offender in any way. Having a cohesive theory of gender is useful, especially for comparing and contrasting different cultures even if it's at a level that's superficial compared to other fields like physics. At the same time, some social movements have realized that they can use aspects of gender theory to grant themselves legitimacy while leveraging the exclusionary nature of academia to silence criticism and largely control the public dialogue.

A good example of the utility gained having a theory of gender is the various African refugee groups who have come to the United States over the years. I had the luck to interview a psychologist who did work helping to evaluate and integrate some of The Lost Boys of Sudan. Many of the Lost Boys exhibited signs of mental trauma and her role in helping them integrate into American life was trying to provide relief for this trauma. She noted that a major challenge initially was that there was common mindset, stemming from a mix of age and a group outlook rooted in the need to survive, where personal problems weren't viewed or even necessarily understood the way they are in the U.S. The result was that many of the refugees she was working with not only didn't see the point in many of her questions (and by extension how to answer them) they also had difficulty engaging with her at all in a manner condusive to therapy as it's practiced in the U.S.

Handwaving a lot for brevity, her initial approach was to explore a trait she had observed and viewed as a quiet but severe sort of masculinity that was extremely common amongst the boys she was working with. Making early progress was largely about coming to understand this quality better and speaking to it in ways that introduced new concepts by breaking them down into smaller ideas which could then be explored by discussing them as contrasts. Is 'quiet masculinity' an especially scientific descriptor? Probably not, but it was nonetheless useful to her as an ad-hoc compass concept for breaking down barriers to providing effective therapy and it was made possible by synthesizing her understanding of gender theory with her broader knowledge of human psychology.

Contrasting the previous anecdote is my experience with undergraduate Gender Studies courses which are often a mix of course work on theoretical fundamentals and lecture presentations. The presentations are rarely outright superfluous from what I've observed but it'd be dishonest to say that they aren't also laced heavily with normative political statements masquerading as sensibility and which offer nothing extra to the lecture in terms of information or utility. This sort of soapbox sophistry is unhelpful because when it's eventually brought to the spotlight it erodes the public's trust in academia which is often too compartmentalized and arcane for a layperson to see where one subfield ends and another begins.

This is, in my opinion, why the social sciences in general make for such an easy punching bag. Because it quickly becomes unclear why in one situation an idea is treated as rediculous while in another it might be presented as a key component of some broader framework.

The penultimate point to all of this being, it's because science deals with running into hard problems that it's important to recognize when we're mostly paving familiar territory and when we're operating mainly with ideas that are 'the best we've got'. The ladder is no less helpful in the face of total ignorance but it can be much more vulnerable to biases of perspective and culture. Vigilant introspection is therefore a constant necessity.


I hate to respond to an essay with a nitpick, but "penultimate" means "second last." If there is an ultimate point that follows your penultimate point, it's not clear where it begins.


From my POV the real problem is that there are phenomenon that apparently cannot be studied scientifically. Like Reiki. Now I'm not a scientist but I am a huge proponent of science. Carl Sagan is among my personal heroes, okay? I am in favor of science over superstition and irrationality.

One day I encountered a phenomenon that is as real and physically dramatic as, say, the electric field of a Van de Graff generator. Being a fan of science I went looking for the math/science/engineering for this phenomenon. Long story short, after a few years of research, I discovered that: 1) this phenomenon has been known since antiquity in a pre-scientific way; 2) several researchers had independently (re-)discovered it multiple times going back about 2.5 centuries; 3) each time it had achieved a certain notoriety but then been debunked and subsequently ignored or forgotten. The first was Mesmer & "animal magnetism", but there are several others including Baron Von Reichenbach & the "Odic force", and Wilhelm Reich & "Orgone". This is repeating today with several kinds of "energy healing" including Reiki (which is distinguished by being Japanese and unabashedly theistic (Rei means Divine)) where practice is widespread, but lacking scientific validation. As I say, I've felt Reiki, so I know it's real, but so far I've personally never been able to interest any scientists in studying it, despite offering to demonstrate it. And the studies that have been done haven't helped. The Wikipedia article for Reiki just straight up calls it pseudoscience.


To the extent that Reiki or any other alternative medicine technique actually does anything, it's easily explainable by the placebo effect.

You know what we call alternative medicine that works? Medicine.


I've heard that joke before, and it's not as funny as you might imagine.

But anyway, well, see, there you go. Meaning no disrespect, my whole point is that you and others like you cannot or will not "see" this thing, yet it's definitely "there".

I think it's got to be the "social side of science" somehow preventing otherwise-rational minds from seeing what's right in front of them (even during a scientific study of Reiki.)


I disagree: I've known some poorer folks, who swear by Reiki, which if I understand correctly is massage (proven in studies to accelerate certain types of healing) and maybe some herbal rubs. I and my rich coworkers, with good insurance, go to professional physical therapists and get real, proven Medicine, which mainly consists of massage and some prescription topicals.

I don't need either the placebo effect, nor mystical energy to explain Reiki. Ditto for most alternative medicines I've seen friends swear by.


The placebo effect itself is fairly spooky.


Not really.

It's all Descartes fault, essentially, by postulating the "mind" as distinct from matter.

More specifically, placebo pain-relief has been associated with endogenously produced opioids for approximately 40 years now (1978, I believe).

More recent research suggests that much of the variance in the effects of pain-killers can be associated with these endogenously produced opioids (when the patient was not aware of the pain-relief, responses were similar).

So essentially, the placebo effect is just a normal part of your body responding to medical treatment, regardless of the efficacy of said treatment.

I'm talking about pain because most of the research is on pain, but the principles are similar where placebos tend to be effective (not against infections, sadly but pain and depression seem to be extremely placebo-responsive).


How does feeling Reiki establish it is real? What sort of objective effect did it have?


> How does feeling Reiki establish it is real?

That's a good question. The subjective sensations correlated with various external events. What I mean is, I satisfied myself that I wasn't hallucinating.

> What sort of objective effect did it have?

Accelerated healing.

I've never tried to make a physical instrument to measure "chi", but the subjective effects of it on living tissue would be measurable with scientific instruments, I'm sure.

- - - -

edit to say: I don't want to hijack the thread but if you're genuinely interested AMA.


>> What sort of objective effect did it have?

> Accelerated healing.

Clinical trials of products designed to accelerate healing are so difficult to do that few products are developed with such an endpoint. Your own experience may satisfy you, and that is great, but it's extremely difficult to set a baseline against which to measure any speed up in "healing".


I have only anecdotes.

I can hold my hand a centimeter away from another person's hand and "channel" Reiki to them and they can feel it clear up to their elbow. But I don't know how to make a scientific instrument to detect that.

(BTW, it is great! I was careless taking a pan out of the oven a few months ago and burned my thumb pretty good. I "beamed" it with Reiki and the pain went away and didn't come back. "So I got that going for me, which is nice."

Quoting myself from back then:

> I burned the tip of my thumb the other day taking a pan out of the oven, 450°F. I was careless and didn't quite hold the potholder correctly, burn reflex overridden so I wouldn't drop the pan, it melted the fingerprint whorls. After I finished hopping up and down, pissed off at my own carelessness, I took a few moments to "beam" Reiki to my thumb, while concentrating on communicating with my healing and immune responses telling them to "get in there and clean things up". The pain went away and didn't come back. A very modest blister formed under the melted skin. The spot was tender, and it stung a bit if I let hot water touch it the first day, but after that there was no tenderness. Two days later the blister was gone. The skin is slightly rough, but it's not peeling, nor is it a scab. Day three I woke up and forgot that I had burned myself. The flesh apparently reconstituted itself. I'm not sure it will even leave a scar. [It didn't.] The skin still hasn't peeled.

> In sum, a minor second-degree burn, apply ten seconds of Reiki, result: no pain and accelerated healing.)


Can you explain in more detail the whole experience, along with how the effect was activated? Personal experience counts for something, since that's how we know everything else :)


My initial "initiation" (that's the jargon used) was unexpected. I was in Seattle, walking down the street, and I passed one of those shops that sells crystals, incense, and Tarot decks. On a whim I went in, and there was a sign about Reiki classes downstairs. I went downstairs and there was a kind of class going on, and the folks in the class were having some sort of exam or test. The teacher of the class looked like if Santa Claus ditched the red reindeer-hide coat and dressed as a pirate. He grabbed me and sat me down in a chair and some woman started waving her arm at me from about a meter away.

I feel a certain ineffable something "filling me up" with a kind of immaterial pressure or presence. I could feel it within the volume of my body and (somehow) extending into the space around me. It's difficult to explain in words what it feels like, people describe both warmth and coolness, presence and emptiness, tingling and relaxation, sometimes in combinations. The best I can do is that it felt like a high-tension electric field if you could somehow feel the field directly. Anyway, she started at the base of my spine and went up the center line to the top of my head at which point I felt a flower- or fountain-shaped field extend out from the crown of my head into the space around my body. Simultaneous with this I experienced a deep psychological change, comparable in intensity to a modest dose of "magic mushrooms" or LSD, that was deeply pleasant and healing. At that time in my life I was suffering from severe depression and now it lifted and I felt wonderful for the rest of the evening. The effect wore off but the depression was never as severe again after that.

So that was my first experience with this thing called Reiki. Nobody had told me anything. I had no beliefs about it going in. And it sure wasn't the "placebo effect" (whatever that is.) I had no idea what to expect, or even to expect anything at all.

After that I found a group that practiced and taught Reiki without charging the traditional fees and worked with them. I also, as I said, went looking for anyone who had "done science" to it, with the results I mentioned above.

There's a lot of variation (in my experience) between different Reiki sessions even with the same people. Most are pretty mellow, and other than a deep and abiding sense of peace and health there's not much you can point to and say, "there, that's the effect".


I can believe what you are saying. I know people who've experienced pretty incredible healing. I've healed quickly from running injuries, but I don't know what to attribute it to. Anyways, I find your account believable.


Cheers! Thank you. :-)

(Really, all I want or expect is for folks to not insult me to my face for sharing the experience. It's not like I'm claiming I rode in a UFO or had dinner with Bigfoot, y'know?)


It is strange how aggressive people can be against such ideas.


I'm confused. So the original paper was shown to be flawed, but then the rebuttal was also shown to be flawed. But this article argues the flaws in the rebuttal was not as glaring as the flaws in the original paper?

But the approach of counting commits on github just seems fundamentally flawed. The first step should be to show that the github dataset can be used to say anything about quality and productivity of things like language choice. Arguing about how to classify commit messages seem pointless without this foundation.


He talks about that in the end. Here’s the relevant section:

Is this even a good idea?

We’ve just spent several thousand words on methodology to show how the original FSE paper was flawed. But was that all really necessary? Most people point to a much bigger issue: the entire idea of measuring a language’s defect rate via GitHub commits just doesn’t make any sense. How the heck do you go from “these people have more DFCs” to “this language is more buggy”? I saw this complaint raised a lot, and if we just stop there I could have skipped rereading all the papers a dozen times over.

Jan Vitek emphasizes this in his talk: if they just said “the premise is nonsense”, nobody would have listened to them. It’s only by showing that the FSE authors messed up methodologically that TOPLAS could establish “sufficient cred” to attack the premise itself. The FSE authors put in a lot of effort into their paper. Responding to that all is “nuh-uh your premise is nonsense” is bad form, even if the premise is nonsense. Who is more trustworthy: the people who spent years getting their paper through peer review, or a random internet commenter who probably stopped at the abstract?

It all comes back to science as a social enterprise. We use markers of trust and integrity as heuristics on whether or not to devote time to understanding the claim itself. This is the same reason people are more likely to listen to a claim coming from a respected scientist than from a crackpot.12 Pointing out a potential threat to validity isn’t nearly as powerful as showing how that threat actually undermines the paper, and that’s what FSE had to do to be taken seriously.


Interesting, but as a layman this seems totally backwards. If the underlying premise is flawed, who cares if the methodology is correct? If the original study hadn't made errors like misclassifying some projects or commit messages, the result would still be based on an unfounded premise.


The factor you're missing is the potential for a criticism to advance the discussion.

Everyone believes that they occupy the intellectual high ground, and that it's everyone else who's guilty of confirmation bias and wooly thinking. Therefore, anybody else's accusations of confirmation bias and wooly thinking are merely a product of their own confirmation bias and wooly thinking, and can safely be ignored.

Attacking the methodology mitigates this effect. By demonstrating that they spent serious time digging into it and trying to replicate it, the authors give a strong signal that they're not just exercising in armchair skepticism. This, in turn, lends rhetorical weight to their attack on the premise. Without it, it would be as easy to ignore, and as quickly forgotten, as your average comment on Hacker News.


> If the underlying premise is flawed, who cares if the methodology is correct?

How do we know the premise is flawed?

> The first step should be to show that the github dataset can be used to say anything about quality and productivity of things like language choice.

How do you concretely do that? The initial paper, even if flawed, did a lot of work to try to control for confounding factors. To do what you said, you would have to put as much effort, if not more, to convince people that the results can be falsified to claim anything, with the same level of control in place.

It's not even clear that counting commits is entirely meaningless. Getting some signal out of it does seem tremendously difficult, but not impossible. So questioning the methodology is really the only way to rebut the experiment.


> If the underlying premise is flawed, who cares if the methodology is correct?

The post you're replying to includes an explanation addressing that point.


To put it simply: if the methodology is sound then the premise can't be unfounded, because the premise is the source of the methodology.


I think you may misunderstand what others mean when they say "premise".

For this paper, the premise is that GitHub commit history indicates language quality.

The methodology is then the implementation of the study: how do we classify each commit? How do we identify the base language in a repository? What would lead us to discard a repository as a data source?

The methodology may be perfect, but you may still end up with nonsense because GitHub is not a comprehensive or representative source of data for studying programming.


No, pretty obviously the claim is that language properties will affect the type of commits being made, this is the premise they start from.

The CACM paper then did a bad job of testing if this is true by not being vary careful with the labelling and cleaning of their data, a point on which they were then attacked.

Please notice that they were critiqued on their methodology, not for thinking that commits are casually influenced by the design of the language they modify.


Their premise is even more deeply flawed than that. AFAICT, they didn't consider selection bias seriously enough.

I'm not sure if it's politically correct to say this, but, e.g., it's entirely possible that particularly meticulous coders are both more likely to select a strong, statically typed functional language and generate bugs at a lower rate. And that the effect turns out to be almost entirely due to the person and not the programming language.

There are also some fundamentally unanswerable questions. Like, would dynamic typing come off looking better if the selection of dynamic languages weren't dominated by warty messes like php and ecmascript?


> I'm not sure if it's politically correct to say this, but, e.g., it's entirely possible that particularly meticulous coders are both more likely to select a strong, statically typed functional language

I don't think you need to worry; this isn't "politically incorrect". It's just wrong.

Agree with what you said afterwards, though: Meticulous coders probably are less likely to write bugs. But that will be true regardless of the languages they choose or prefer.


> It's just wrong.

It's something everyone has a strong opinion on, but nobody's ever really figured out how to measure. Leaving it a subject that's still firmly in the province of conjecture. Typically self-affirming conjecture.

Me, I like dynamic languages, but, given that this subject is pretty much a complete knowledge vacuum, populated only by occasional iffy, easily criticized studies like the one in question, I'm inclined to say that the only position that offers any decent chance of not being informed primarily by confirmation bias is to assume that, for the most part, none of us has any firm idea what they're talking about. Least of all oneself.


> Like, would dynamic typing come off looking better if the selection of dynamic languages weren't dominated by warty messes like php and ecmascript?

It already looks better if you consider that dynamic typing is more expressive and such languages take much less code to do the same thing and less time. This is not something that can be researched via commit messages, but it is rather obviously true, so the claim from the paper that "static typing is better" is also obviously false and is way out of paper's reach to even claim. But then again this is CS and such claims that go way beyond data and author's expertise are super common in CS papers. Personally I stopped taking CS papers seriously a long time ago.


> This is not something that can be researched via commit messages, but it is rather obviously true

I am inclined to disagree. Both static and dynamic languages suffer from the same problem here: The most well-known examples from each camp are warty messes that make terrible case studies, let alone poster children.


> The first step should be to show that the github dataset can be used to say anything about quality and productivity of things like language choice.

The dataset can be used to say how defect rates compare across programming languages, that's important given that in CS and PL research these things are usually claimed left and right without any evidence. What the dataset can't say is how that translates to quality and productivity, as that would require a lot more than just commit messages. But at the same time it's something people already have some rough ideas about, like that people writing in low level rigid inexpressive languages are less productive than in high level flexible expressive languages.


Number of commits as a metric is a poor idea, because it will immediately be gamed. I constantly advocate with D that people break up larger PRs into smaller self-contained ones, for several reasons. Undermining this with a metric that draws unfounded conclusions isn't helpful.


Just make sure you don't use the word "fix" in the commit messages, then the reliability of D will be through the roof!


That was funny, but in fairness can you imagine how expensive and long drawn a real study into this question would be?


All it takes to avoid that is looking into commit histories of projects in each language and figuring out how they mark bugfixes. If it's suspiciously reliable you would know you made a mistake classifying bugfixes somewhere or your sample is too small or the programming practice in the language is unusual and needs closer look.


Interesting rundown, thanks!

I'm a bit puzzled with the "not a dox" tweet, what is going on here? I know that this is probably the least interesting part of the article, but I'm confused and curious.


Ooops, didn't quite explain that properly! The tinyurl links to Berger's Facebook page.


It's amazing the paper got published in the first place – the idea that comparing open source codebases could tell you all that much about the underlying language is sort of bizarre when comparing such different projects to begin with.


It is not the role of peer-review to do this kind of gatekeeping (although it would be nice if it could filter out bad methodology and data errors). This is the role of exchanges just like the one discussed here.


Journals can reject a paper on editorial grounds – for example, Nature doesn't publish every valid paper submitted to it.


True, but it is not the role of peer-review as a whole to prevent papers with whose premise you can argue from being published anywhere. The goal of peer-review is to filter out papers that use bad methodology. Debates over premises is carried out in other papers.


Related from a few months ago: https://news.ycombinator.com/item?id=21637411


Every project grows until it becomes too complex and hence unmanageably buggy. Some languages/static analyzers/linters/architectures/frameworks let a project become bigger and more complex before becoming unmanageably buggy.

So if you look at two projects, and both have a bunch of bugs, that doesn't tell you much. You have to know bugs-per-unit-complexity, which is pretty damn close to unmeasurable.

It's telling, though, that the companies managing the world's most complex systems run them in C++, Java, and TypeScript. And more recently, Go and Rust.

With much finance research, if the results were real, the researchers would be making billions on Wall Street rather than be academics. So, too, with software research. If the results were real, the researchers could become gazillionaires advising FAANG instead of being academics.


This is something I come back to repeatedly when attempting to evaluate claims about certain technologies (e.g. programming languages) being extravagantly effective: people's sentiments about how great some technology is often seems out of proportion to what their gains were through using it.

Of course that's difficult to evaluate. But if their gains were in proportion, I think you'd see these crazy spikes like what you're saying about these researchers becoming gazillionaires. And it just doesn't seem to happen.

Which is not to say there's no difference—but 1) the improvements seem to be by smaller factors than what engineers believe 2) they don't seem to generalize as much as is often claimed: afaict, a significant aspect of the gains has to do with reducing impedance mismatch between problem domain and language (but advocates will typically adopt a stance that the language is just intrinsically superior).

I would be very interested to hear counter-examples if people have them (I know the PG/Viaweb/Lisp story).


I think the folks really fascinated with which language is intrinsically superior are missing out on the fact that the language itself is <10% of the software development process. There's tooling, architecture, libraries, monitoring, etc. The language's feature set or verbosity is rarely the bottleneck.

I'm skeptical that the PG/Viaweb/Lisp story generalizes at all to non-geniuses. Like, sure, if you gave Paul Graham raw assembly he could probably do some pretty impressive stuff even so. Doesn't mean anyone else can. He's a genius. So is Robert Morris.


It's astounding how discussing programming languages can devolve into flamewars, even when the discussion starts as actual academic research and the participants are dedicated scientific researchers.


This is a truly terrific post, but it says:

> They’re underselling the effect here. While the dominant factor is number of commits, language still matters a lot, too. If you choose C++ over TypeScript, you’re going to have twice as many DFCs! That doesn’t necessarily mean twice as many bugs, but it is suggestive. Further, while they say the effect is “overwhelmingly dominated by […] project size, team size, and commit size”, that doesn’t actually bear out. Only the number of commits is a bigger factor in language choice.

This is inaccurate. "Effect" is not the expected difference (i.e. difference between the means) but, roughly, the expected difference divided by variance. Just looking at the expected difference is insufficient to determine if the effect is large (and undersold) or small.

Expected difference is not an interesting statistical property, just as the mean isn't (by itself). If you're looking at ten Clojure projects and ten C++ projects, and all ten Clojure projects have 10 DFCs, while eight C++ projects have 8 DFCs and two have 500, then the expected difference is huge, but the effect is small. Indeed, when looking at the variance, Clojure, the "best"-performing language in this dataset, and C++, the "worst"-performing language in this dataset, the two were not very distinguishable, supporting everyone's finding of a very small effect.


> Expected difference is not an interesting statistical property, just as the mean isn't (by itself).

Difference in means can meet the definition of effect size, and are listed as an example right away in the wikipedia article on it [0]! The key is that it capture the phenomena of interest (and generally, the magnitude shouldn't be a function of number of observations). In psychology, where scales often have arbitrarily defined ranges (eg IQ scales), means usually are not useful effect sizes.

The case you list is a separate issue from effect size (probably the violation of distributional assumptions in some implicit or explicit model). Even using difference in means / variance (say, cohen's d), two observations that extreme would make the interpretation of both mean and variance calculations pretty dubious (a problem not solved by dividing them).

https://en.wikipedia.org/wiki/Effect_size


Which example? The six examples that talk about "difference in means" really show the difference in means divided by some measure of variance. Who would consider the same expected difference to mean the same effect size when the population distributions are narrow and when they're wide?

> two observations that extreme would make the interpretation of both mean and variance calculations pretty dubious

Yeah, it's a bad example, but I think it at least gives an intuition for why expected difference is not a good measure of effect. This shows two data sets with the same expected difference but one shows a large effect and one shows a small one: https://imgur.com/a/hg2NNl2 (the rebuttal paper also draws the expected value with variance and shows a similar situation).


> Which example?

First paragraph: "Examples of effect sizes include the correlation between two variables,[2] the regression coefficient in a regression, the mean difference, or the ..."

> Who would consider the same expected difference to mean the same effect size when the population distributions are narrow and when they're wide?

For the case where the unit of measurement has a sensible, relevant interpretation (e.g. if a study measured dollar value of two interventions), and attempts to capture uncertainty via CI or another means, I would consider it one meausure of effect size.

The key to understanding effect size is that its most common focus is around making measures over arbitrary scales become scale invariant. But scales are not always arbitrary.

Daniel Lakens has a great article on ES and puts the motivation for calculating them well..

> First, they allow researchers to present the magnitude of the reported effects in a standardized metric which can be understood regardless of the scale that was used to measure the dependent variable. Such standardized effect sizes allow researchers to communicate the practical significance of their results (what are the practical consequences of the findings for daily life), instead of only reporting the statistical significance (how likely is the pattern of results observed in an experiment, given the assumption that there is no effect in the population).

https://www.frontiersin.org/articles/10.3389/fpsyg.2013.0086...

To the degree that the thing measured has an inherently meaningful scale to the researchers (eg sometimes dollars, time), then it is already an effect size measure.

There is some nuance here, since the level on which you might want to interpret something (eg the spread could be part of the value, especially to an individual who will have 1 and not many codebases).

You might also want to render it comparable to other studies that measure something else, and it's unclear how to convert it to your scale, so you use standardized ES measures (eg many meta analyses).

But an important point is that whether something qualifies is really a question of the scale. (What the best ES is for your specific question is another important issue!)


> the mean difference

Yes, but later the article shows that "mean difference" is really mean difference divided by variance.

> an important point is that whether something qualifies is really a question of the scale

It's one of the concerns, but I would say it's the main one. An effect size needs to distinguish between the two cases here, both having the same expected difference: https://imgur.com/a/hg2NNl2

If you care about units, you can talk about the expected value/difference, but that doesn't make that a meaningful effect size. What you need to do in those cases is to, at least, mention both the expected difference and the variance.


> Yes, but later the article shows that "mean difference" is really mean difference divided by variance.

I would say later the article shows examples of standardized effect size, so divides by variance. Whether that means effect size can't be a difference in means (and explains the wiki intro comment) I disagree with--see my Lakens comment for why.

> If you care about units, you can talk about the expected value/difference, but that doesn't make that a meaningful effect size. What you need to do in those cases is to, at least, mention both the expected difference and the variance.

This seems like it is begging the question. If you look at the definition and rationale for effect size, both in the wikipedia and in Lakens, a claim this strong is not there. For example, a CI over difference in means is one way to capture what a standardized effect size that uses a specific variance term might be doing (what Lakens describes as the second ES viewpoint: statistical significance, distinguishing between effect and power).

In any event, thanks for discussing--it's been really helpful to think about and revisit this topic!


The difference in means might be a useful measure of effect size if you are interested in comparing the results of two experiments, each with large n. It tells you what difference to expect if you repeated those experiments with new samples.

If, however, you are interested in the effect on a single observation, then the variability in samples/populations needs to be taken into account.


DFC = "Defect-Fixing Commits"


> the entire idea of measuring a language’s defect rate via GitHub commits just doesn’t make any sense. How the heck do you go from “these people have more DFCs” to “this language is more buggy”?

If Github is used by ordinary programmers (what are the reasons to assume otherwise?) i.e., if the sample is unbiased then what is the issue with going from "these [random] people" to "this language"?


I'm surprised someone takes my github saved games seriously. It certainly wasn't me.

I think this science project would be more fun if assumptions were discarded. Just gather the data, do it without a hypothesis. Something interesting might come up worthy of one.

I liked the bit of crankology in the article. If you sound like a crank no one will take you seriously? I had one day pondered what if the cranks are right? How would we know? It follows that we can't know sounding crank equals being wrong and that our idea of "sounding serious" is based on a flawed data set, if it isn't noise entirely. The topic of the study here certainly doesn't inspire my confidence.


Such post hoc hypothesis trawling essentially is p hacking. Try a large enough set of explanations and your data will show something significant. Might be hard to replicate.


One of my favorite examples is this tour de force on "proving" that chocolate helps with weight loss: https://io9.gizmodo.com/i-fooled-millions-into-thinking-choc...


or the forever classic: https://www.xkcd.com/882/


Replication would indeed be necessary. Objectivity is the stuff to make the science!


> But it still seems intuitive that language should matter.

Am I the only one that thinks this is a naive assumption? They aren't really languages like English is a language--so much for intuition. Don't definitions matter to science? Does science carry a selfie stick? I'm suddenly keen on the antithesis of science or a less social science.


If all you cared about was fibonacci, or some other "quality", then fine, use that as benchmark. But others might care more about productivity, cost, etc. Then it's also a matter of taste and what is currently fashionable.


There's talk and comments about the measure ("DFCs"). What would be better? And is it possible to use this one to eke out a signal even though the measurement has some known problems?


Two comments:

"FSE was a preprint: a paper that hadn’t necessarily gone through peer-review, editing, and publication in a journal. Academics share preprints because 1) the process of getting a paper journal-ready can take years, so this gets it out faster, and 2) academics can share their research with people who aren’t able to afford the ludicrous journal fees."

To my knowledge, all ACM conferences including the "ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering" are peer-reviewed. In CS, in my experience, peer-review for conferences is more thorough than journal review. Conferences are more important. The idea that they are preprints is something from linguistics or some other field.

"How did FSE take the rebuttal rebuttal? Not well.

"So that’s technically not a dox, because he didn’t publish Berger’s private information, but still. That’s a really asshole thing to do!"

I'm not sure I would call linking to someone's Facebook page from Twitter to be "doxxing" in any sense. Calling a fellow researcher a "donkey" is a bit of an asshole thing, too. (I have no link to said comment, but I know Emery Berger. He's very positive he's right, the smartest in the room, and very abrasive.)

Tl;dr: Software engineering research is a garbage fire. But we knew that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: