The data dump diversion

I’ve been writing non-stop about the Stack Overflow strike and there’s a little side-story that might slip by if you aren’t paying attention. I’m talking about the decision to not upload the Stack Exchange Data Dump to the Internet Archive. If you aren’t deep in the weeds of Stack Exchange, it’s easy to miss this esoteric topic that looks inconsequential from the outside, but might sit at the core of the conflict.

It all starts with a programmer’s favorite legal topic: licensing. There’s a cultural angle, but for our purposes I want to focus on the practical reason for caring about licenses. If a programmer finds a bit of code, can they use it in their own software? The answer in the early days of programming was “Uh. Why not?” But as companies started paying people to write software, it became clear this answer might end up costing a lot of money. If I pay for some software tool, I won’t be happy with the programmer turning around and giving the code away to a rival.

Without getting into a long history, Stack Overflow settled on a Creative Commons license before there was even a site to post on. Specifically Stack Exchange contributions are licensed under CC BY-SA ¹ which allows anyone to share and adapt content as long as:

Appropriate credit is given to the original author.
Derived content is shared under the same license as the original.
There are no other restrictions.

That means, for instance, you can sell a book that collects Stack Exchange posts. It also means that Stack Overflow, the company, can host the content and try to make money off it. That includes selling the content as long as it points back to the freely-available version. IANAL and as far as I can tell none of this has been tested in court, but it does seem like a win for content creators (who care more sharing their content than getting paid) and content hosts (who don’t have to pay for content).² Nobody benefits from disrupting the status quo.

So, of course, ChatGPT³ had to come along and disrupt everything. One side of the problem is that people are posting machine-generated content on Stack Exchange sites. How to handle that (or even whether) stands at the crux of the company – moderator conflict. Secondly, people are turning to ChatGPT rather than Google and/or Stack Overflow thereby reducing participation on the site and hurting the company’s bottom line. These are complicated issues which will take some time to sort out.

But the third way Large Language Models (LLMs), such as ChatGPT, disrupt is that they use Stack Exchange content as input. On April 20, Wired published “Stack Overflow Will Charge AI Giants for Training Data: The programmer Q&A site joins Reddit in demanding compensation when its data is used to train algorithms and ChatGPT-style bots” which quoted CEO Prashanth Chandrasekar:

Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive. We’re very supportive of Reddit’s approach.

Reddit’s approach, incidentally, resulted in a subreddit strike due to the company charging to use their API, which locked out community-maintained apps and bots. Before that, Twitter raised their API prices in response to LLMs, which locked out non-AI researchers. One more company would be a pattern.

When asked about it on Meta, the site where the Stack Exchange community can discuss the network, Philippe Beaudette, VP of Community wrote:

But the community is - you are - being denied your rightful attribution as it stands right now. Prashanth is saying that you should at least get to benefit from the financial impact. This is about protecting your interest in the content that you have created.

This answer conflates the attribution right provided by the license with the financial benefit that can be derived from the content. Given that Bard, Bing and Phind can credit sources, it doesn’t seem like charging for data is the appropriate response to the attribution problem. It’s unfortunate that Stack Overflow was late to the game, but nothing stops it from using CC BY-SA licensed data to create its own ChatGPT clone. Charging for the data suggests Stack Overflow contemplated taking away its free data sources:

Stack Exchange Data Explorer (SEDE)
API
Data Dump, which gets sent to the Internet Archive every quarter.

Nothing seemed to come from that announcement, however, since access to these sources was not removed and no other announcement was made. Then on June 7, shortly after the strike started, a user named Data Dude pointed out that the June 2023 Data Dump was missing. On June 9, AMtwo, a recently laid off Stack Overflow DBA, answered:

The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership. Had it run as scheduled, it would have completed on the first Monday after the first Sunday in June.

I mention the timing, as this change long pre-dated the current moderator strike and related policy changes. Some comments have suggested otherwise, so I thought it an important detail.

Later in the day, Jody Bailey, Stack Overflow’s Chief Technology Officer answered:

Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.

He also mentioned “looking for ways to gate access to the Dump, APIs, and SEDE”.

Finally Philippe answered:

Our intention was never to stop posting the data dump permanently, only to pause it while we begin to collect more information on how it was being used and by whom - especially in light of the rise of LLMs and questions around how genAI models are handling attribution. However, it’s clear that many individual users (academics, researchers, etc) have an immediate need to access updated versions of the dumps. So we are re-enabling the automatic data dumps (and uploading the one that’s about a week overdue). We believe that this can happen by end of the day Friday. We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

And then Jeff Atwood, co-founder of Stack Overflow, commented:

I have confirmation via email from Prashanth that this is, indeed, the new official policy. I’m glad to see it. Creative Commons is part of our contract with the community, and it should never be broken – however, CC does need to address the AI issue in an updated license, in my personal opinion.

So all’s well that ends well, right?

Just one more thing . . .

Hard to object to forcing large companies to pay for the data they need to build LLMs. But shutting down the data dump in March 2023 is a year and a half too late for ChatGPT:

Me: Who was the National League batting champion in 2022?

ChatGPT: I apologize for any confusion caused. As an AI language model, my knowledge is based on information available up until September 2021, and I do not have access to real-time data or the ability to browse the internet. Therefore, I cannot provide you with the name of the National League batting champion for 2022. I recommend checking reliable sports news sources or conducting an online search to find the most up-to-date information on the 2022 National League batting champion.

Newer models almost certainly use updated data, but it’s not clear the value of new data compared the bulk of data that was available for free long before these models were designed. This is especially true if the newer data includes output from LLMs. To make matters worse, if Stack Overflow cuts off data dumps, API and SEDE, LLMs could still get all the data they need from good, old-fashioned web-scraping.⁴

It’s also unclear what Stack Overflow could do to prevent LLMs from using the data dump without cutting off everyone who uses it. For all the discussion of “guardrails”, it’s unlikely the company could block access to the data, which is distributed via BitTorrent, a decentralized peer-to-peer protocol. My best idea would be to add a copyright trap, but that’s not a useful technique when the data is processed via an LLM.

Stack Overflow is in a bind with ChatGPT and no mistake. Unfortunately, the company seems to be invested in pretending it knows how to get out of its predicament when all evidence suggests nobody has a solution.

My source explained that while the data dump was discontinued in March, most of the developers only found out about it in June when it was revealed publicly on Meta. As you might imagine, this prompted questions. Philippe wrote in the #Engineering Slack channel:⁵

To add a bit more context to the data dumps issue, the team is working on a strategy in light of the fact that LLMs have been actively data mining our dumps and reusing the information in ways that we believe violate the spirit (if not the letter) of the Creative Commons license. We’re in conversation with Creative Commons and other peers about the impact of this active data mining on the license and we are evaluating this and other significant questions (how to assure that our community has their role in content creation respected and assure appropriate access control for LLMs while remaining transparent with the community).

We’ll share updates once we have them. Thank you for your patience. Please do not respond to the meta thread about this. A response will come from the communications and community teams. Jody has more as well, which he’ll be posting.

Three minutes later Jody followed up:

Hi all, the lack of internal communication to you is on me. As Philippe point out, the senior leadership team has been working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place. This decision was made a couple of months ago, and for whatever reason, I didn’t think to share the decision with all of you. I never tried to hide it, but it didn’t occur to me to share beyond the people involved. This was a huge oversight and a mistake on my part. I am sorry for that.

What I can tell you is that we are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow people access to the data while attempting to prevent misuse by organizations looking to profit from the work of our community. I am working with Legal, Product, and Community leadership to design and implement appropriate safeguards. As this project is in flight, and it seems things change almost daily, we’re still sorting out the details, and I do not have a specific timeline. But I do commit to providing regular updates on our progress. Right now, we’re working on requirements which I anticipate will be complete by the end of the month. I will provide a status update on this next week and each week following.

In the meantime, please remember not to comment publicly unless asked to do so.

Apparently people in Engineering were instrumental in reversing the data dump decision. As developers, they share many of the values of the most dedicated users of Stack Overflow and that includes creating a freely available repository of knowledge. Given that Stack Overflow laid off 30% of engineering in May, tension with management seems inevitable.

At the risk of indulging in Kremlinology, I suspect Jody and Philippe are falling on their swords to protect the CEO. Prashanth seems obsessed with AI technology in his public-facing communication, which is perfectly understandable given how disruptive it has already been to the business. But the rather erratic way Stack Overflow has responded to this problem fits the pattern of how it responded to the 2019–2020 community crisis.

Specifically, the CEO ordered a bunch of new projects and changes without listening to input from the people who were tasked with implementing his ideas. It’s possible Jody didn’t think to inform Engineering (and Community) of the change. It’s also possible the CEO made an uninformed decision that doesn’t stand up to scrutiny. Pushing back on a CEO’s decisions takes effort and I can easily imagine not thinking the data dump worth sticking your neck out for.

On June 14, users discovered another project pushed by the CEO: Stack Overflow Labs.⁶ According to my source, few people in the company (outside of the ~10% who are working on AI projects) knew it was coming. The site collects blog posts and announcements about AI. Three of the listed projects are very early stage. A fourth project is a fairly dry report of the 2023 Developer Survey questions around AI/ML. The key purpose of the page appears to be accompanying the CEO’s talk at the WeAreDevelopers World Congress in July where he’s likely to announce Stack Overflow is using an LLM from Prosus, its parent company.

Years ago, when I was still a relatively new employee at Stack Exchange, there was an infamous meetup where long-subdued conflicts between Engineering and Marketing erupted. Stack Overflow was founded as a technology company and neglected the important step of telling potential customers about what it had built. From that point on, leadership tended to be suspicious of engineering’s influence over the company. In my time as an employee, I saw the pendulum swing to the extent we sold products that didn’t solve customer problems in hopes that they would start working before the customers stopped paying.

The strategy seems to have worked with Talent and Teams to some degree. It might work with Collectives and whatever AI offering is coming if given enough time. Leadership probably thinks the limiting factor is funding. They are starting to find out they’ve nearly spent their community trust, which isn’t so easy to raise.

There have been three versions of the licence. In addition, the company considered licensing code under the MIT License, but that plan fell through.↩︎
Plenty of people don’t care about how their contributions are licensed. College Confidential, where I currently work, owns all content straight up.↩︎
There are other tools available, but ChatGPT is the clear favorite that’s driving much of the hype.↩︎
Obviously this would be less conveient and potentially more expensive in terms of computational power and bandwidth. In addition, Stack Overflow can block the bot, if they wanted to. As of today, however, they haven’t.↩︎
This channel is seen by the majority of the company. When I was still an employee, it was hosted on the company’s own chat system and was called “HardCore”. It was a different time.↩︎
The page has a “Sign up” button, but it’s really a way for the site to collect email addresses. The email you’ll recieve says:

We’ll use this email address to keep you up to date around what we’re working on and share previews and opportunities for early access to some of our new and upcoming features.

↩︎