upvote
Please see below. This is from the OG, "first generation" Copilot, from 2022. If I can find any more from my dusty trove, I'll edit or reply to this very comment. I can't do more digging now, because I'm in a pinch.

> Re: GPL, there are other open access datasets of git repos that make some distinctions between copyleft licenses but those are older resources now.

Arguably "The Stack" contains only permissively licensed code, but there are two repositories of mine inside it. One is a very simple logging library, without any license (which implies "All Rights Reserved"), and another is a fork of LightDM which I worked on, which is GPL licensed.

So any "permissively licensed" dataset probably contains at least one copylefted or strong copyrighted codebase, making them highly suspicious.

== EDIT ==

Found some. Kagi's date-constrained search to the rescue.

1. Should GitHub be sued for training Copilot on GPL code?: https://news.ycombinator.com/item?id=31847931

2. GitHub Copilot, with “public code” blocked, emits my copyrighted code: https://news.ycombinator.com/item?id=33226515

3. AI-Powered GitHub Copilot Leaves Preview, Now Costs $100 a Year: https://developers.slashdot.org/story/22/06/25/0334207/ai-po...

4. GitHub Copilot is trained on all languages that appear in public repositories (CTRL+F on the page): https://web.archive.org/web/20260428180443/https://github.co...

reply