Microsoft's GitHub Gets Sued Over AI Code Writer
Copilot is illegally pirating software say the plaintiffs
A group calling itself GitHub Copilot Litigation has filed a class action lawsuit against Microsoft and OpenAI for what boils down to software piracy. Copilot is a product of Microsoft’s GitHub division that helps software developers write code faster by providing or suggesting code using the technology behind OpenAI’s GPT-3. This is the problem, according to Matthew Butterick, who has organized the effort against Microsoft and OpenAI.
Software Piracy Enabled by AI?
The suit was filed in U.S. federal court in San Francisco to challenge the “legality of GitHub Copilot (and a related product, OpenAI Codex, which powers Copilot).” The plaintiffs allege in their filing:
Copilot ignores, violates, and removes the Licenses offered by thousands—possibly millions—of software developers, thereby accomplishing software piracy on an unprecedented scale. Copilot outputs text derived from Plaintiffs’ and the Class’s Licensed Materials without adhering to the applicable License Terms and applicable laws.
Butterick is co-counsel for the class action lawsuit and lists his background as a “writer, designer, programmer, and lawyer.” He created three different types of font and a programming language for web publishing. Butterick also has degrees from Harvard and UCLA. In an open letter on gihubcopilotlitigation.com he summarizes the effort saying:
By training their AI systems on public GitHub repositories (though based on their public statements, possibly much more) we contend that the defendants have violated the legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub. Which licenses? A set of 11 popular open-source licenses that all require attribution of the author’s name and copyright, including the MIT license, the GPL, and the Apache license.
The Verge offered some interesting insights into this story, including a Twitter post from Tim Davis.
Butterick told The Verge’s James Vincent, “This is the first step in what will be a long journey. As far as we know, this is the first class-action case in the US challenging the training and output of AI systems. It will not be the last. AI systems are not exempt from the law. Those who create and operate these systems must remain accountable.”
How Code Differs from Prose and Art
When GPT-3 first emerged, there was concern that the text generator was simply plagiarizing written work from the web. It turns out that this has largely not been the case. Some of that may be the result of a simple fact. You can express the same idea in many different ways using written language. Software code doesn’t offer you the same latitude.
There are certainly multiple ways to formulate code for a variety of tasks, but it tends to be closer to a deterministic exercise as computers require instruction sets to be in a specific format. Writing a poem, essay, or email doesn’t require the same adherence to a narrow set of rules.
The same is likely true for artwork, though there is also controversy about the text-to-image generators that are too closely copying the styles of living artists. “People are pretending to be me. I'm very concerned about it; it seems unethical,” said artist Greg Rutkowski in an interview with Business Insider. At the time of the article, there were 93,000 images that included Rutkowski’s name in the prompt. Granted, this is replicating a style as opposed to providing an exact copy.
It is not surprising that OpenAI Codex offers exact replicas of code blocks. If there is a “standard” method for writing code in a programming language to execute a function, OpenAI’s training data is going to see that code in the same format over and over again. GPT-3 will also want to provide that code in a format that is highly likely to be correct as opposed to considering variations. The training of the system and the rules associated with the output will bias and text-to-code generator toward an exact replica, whereas this can more easily be avoided with prose and art.
What’s Next?
The issue of generative AI and the ownership rights of content used in the training data is a new legal frontier. In fact, this issue may pose the biggest threat faced by the industry. The technology is working well, and adoption of the applications based on generative AI is growing quickly. That means the technical risk and demand risk for these solutions have been significantly mitigated. The legal risk is unknown. Precedents set by the outcome of this class action lawsuit could have far-reaching implications.
The fact that Microsoft, the long-standing champion of the proprietary software model, is at the center of this controversy as a defendant is noteworthy. Microsoft has, in recent years, heavily promoted the use of open source software. The GitHub acquisition and interest in hosting open source software and services in Microsoft’s Azure cloud have changed the company’s traditional bias toward licensed software.
You have to wonder how Microsoft would react if code from Microsoft Excel suddenly started showing up in Copilot for developers creating the next spreadsheet application. I suspect this would be addressed swiftly.
You might think, what’s the issue here? It’s open source software. That means it’s free, so no one is getting hurt. This may well be the foundation of the legal defense. However, even free software often includes terms that must be followed that extend beyond price. And don’t be surprised if, during due diligence, a lot of incidents of proprietary software code showing up in Copilot are unearthed. When you crawl large data sets, you never quite know what is in there.