GitHub's boast that Copilot produces high-quality code challenged
We're shocked – shocked – that Microsoft's study of its own tools might not be super-rigorous
GitHub's claim that the quality of programming code written with its Copilot AI model is "significantly more functional, readable, reliable, maintainable, and concise," has been challenged by software developer Dan Cîmpianu.
Cîmpianu, based in Romania, published a blog post in which he assails the statistical rigor of GitHub's Copilot code quality data.
If you can't write good code without an AI, then you shouldn't use one in the first place
GitHub last month cited research indicating that developers using Copilot:
- Had a 56 percent greater likelihood to pass all ten unit tests in the study (p=0.04);
- Wrote 13.6 percent more lines of code with GitHub Copilot on average without a code error (p=0.002);
- Wrote code that was more readable, reliable, maintainable, and concise by 1 to 3 percent (p=0.003, p=0.01, p=0.041, p=0.002, respectively);
- Were 5 percent more likely to have their code approved (p=0.014).
The first phase of the study relied on 243 developers with at least five years of Python experience who were randomly assigned to use GitHub Copilot (104) or not (98) – only 202 developer submissions ended up being valid.
Each group created a web server to handle fictional restaurant reviews, supported by ten unit tests. Thereafter, each submission was reviewed by at least ten of the participants – a process that produced only 1,293 code reviews rather than the 2020 that 10x multiplication might lead one to expect.
GitHub declined The Register's invitation to respond to Cîmpianu's critique.
Cîmpianu takes issue with the choice of assignment, given that writing a basic Create, Read, Update, Delete (CRUD) app is the subject of endless online tutorials and therefore certain to have been included in training data used by code completion models. A more complex challenge would be better, he contends.
He then goes on to question GitHub's inadequately explained graph that shows 60.8 percent of developers using Copilot passed all ten unit tests while only 39.2 percent of developers not using Copilot passed all the tests.
That would be about 63 Copilot using developers out of 104 and about 38 non-Copilot developers out of 98 based on the firm's cited developer totals. But GitHub's post then reveals: "The 25 developers who authored code that passed all ten unit tests from the first phase of the study were randomly assigned to do a blind review of the anonymized submissions, both those written with and without GitHub Copilot."
Cîmpianu observes that something doesn't add up here. One possible explanation is that GitHub misapplied the definite article "the" and simply meant 25 developers out of the total of 101 who passed all the tests were selected to do code reviews.
More significantly, Cîmpianu takes issue with GitHub's claim that devs using Copilot produced significantly fewer code errors. As GitHub put it, "developers using GitHub Copilot wrote 18.2 lines of code per code error, but only 16.0 without. That equals 13.6 percent more lines of code with GitHub Copilot on average without a code error (p=0.002)."
Cîmpianu argues that 13.6 percent is a misleading use of statistics because it only refers to two additional lines of code. While allowing that one might argue that adds up over time, he points out that the supposed error reduction is not actual error reduction. Rather it's coding style issues or linter warnings.
As GitHub acknowledges in its definition of code errors: "This did not include functional errors that would prevent the code from operating as intended, but instead errors that represent poor coding practices."
- OpenAI denies it is building ad biz model into its platform
- Claims of 'open' AIs are often open lies, research argues
- Brits think AI in the workplace is all chat, no bot for now
- Yup, half of that thought-leader crap on LinkedIn is indeed AI scribbled
Cîmpianu is also unhappy with GitHub's claim that Copilot-assisted code was more readable, reliable, maintainable, and concise by 1 to 3 percent. He notes that the metrics for code style and code reviews can be highly subjective, and that details about how code was assessed have not been provided.
Cîmpianu goes on to criticize GitHub's decision to use the same developers who submitted code samples for code evaluation, instead of an impartial group.
"At the very least, I can appreciate they only made the developers who passed all unit tests do the reviewing," he wrote. "But remember, dear reader, that you're baited with a 3 percent increase in preference from some random 25 developers, whose only credentials (at least mentioned by the study) are holding a job for five years and passing ten unit tests."
Cîmpianu points to a 2023 report from GitClear that found GitHub Copilot reduced code quality.
Another paper by researchers affiliated with Bilkent University in Turkey, released in April 2023 and revised in October 2023, found that ChatGPT, GitHub Copilot, and Amazon Q Developer (formerly CodeWhisperer) all produce errors. And to the extent those errors produced "code smells" – poor coding practices that can give rise to vulnerabilities – "the average time to eliminate them was 9.1 minutes for GitHub Copilot, 5.6 minutes for Amazon CodeWhisperer, and 8.9 minutes for ChatGPT."
That paper concludes, "All code generation tools are capable of generating valid code nine out of ten times with mostly similar types of issues. The practitioners should expect that for 10 percent of the time the generated code by the code generation tools would be invalid. Moreover, they should test their code thoroughly to catch all possible cases that may cause the generated code to be invalid."
Nonetheless, a lot of developers are using AI coding tools like GitHub Copilot as an alternative to searching for answers on the web. Often, a partially correct code suggestion is enough to help inexperienced coders make progress. And those with substantial coding experience also see value in AI code suggestion models.
As veteran open source developer Simon Willison observed in a recent interview [VIDEO]: "Somebody who doesn't know how to program can use Claude 3.5 artefacts to produce something useful. Somebody who does know how to program will do it better and faster and they'll ask better questions of it and they will produce a better result."
For GitHub, maybe the message is that code quality, like security, isn't top of mind for many developers.
Cîmpianu contends it shouldn't be that way. "[I]f you can't write good code without an AI, then you shouldn't use one in the first place," he concludes.
Try telling that to the authors who don't write good prose, the recording artists who aren't good musicians, the video makers who never studied filmmaking, and the visual artists who can't draw very well. ®