OpenAI signs agreement to train AI using Reddit data


In a blog post on OpenAI’s press relations site, the company announced a partnership with Reddit that grants access to “real-time, structured, and unique content” from Reddit, including posts and replies. This collaboration aims to enhance OpenAI’s tools and models, particularly ChatGPT, to better understand and represent Reddit content. Additionally, the partnership will introduce new AI-powered features for Reddit users and moderators.

“Reddit will leverage OpenAI’s AI models to bring its vision to life,” OpenAI stated in the blog post. “By using large language models (LLMs), machine learning (ML), and AI, Reddit aims to enhance the user experience for all.”

This deal is part of OpenAI’s broader strategy, which includes similar agreements with various content providers, from stock media libraries to news publishers. A unique aspect of this partnership is that Sam Altman, OpenAI’s CEO, holds an 8.7% stake in Reddit, making him the third-largest shareholder and a former board member. To address potential conflicts of interest, OpenAI emphasized that the partnership was led by its COO, Brad Lightcap, and approved by an independent board of directors.

In its IPO prospectus, Reddit disclosed contracts worth over $200 million with customers, including Google, for licensing its data. Reddit’s first earnings report as a public company showed a 450% year-over-year increase in non-ad revenue, largely due to these agreements. Following the announcement of the OpenAI deal, Reddit’s stock rose 11% in extended trading.

Reddit CEO Steve Huffman highlighted the value of authentic content during a recent earnings call, stating, “As more content on the internet is written by machines, there’s an increasing premium on content from real people. We have nearly two decades of authentic conversation.”

Reddit’s platform, hosting over 1 billion posts and 16 billion comments, offers a rich resource for generative AI companies that learn from content examples to create new material. However, this move may face backlash from users concerned about the monetization of their data.

A parallel situation occurred with Stack Overflow, which partnered with OpenAI for data supply. In response, some users deleted their top-rated answers in protest. Stack Overflow restored the deleted posts and banned those users, citing non-compliance with its terms of service.

Reddit has already shown resistance to initiatives that aim to give users more control over their data. For instance, Reddit banned the subreddit of Vana, a blockchain startup attempting to create a data DAO (Digital Autonomous Organization) to let users collectively decide on the use or sale of their data. Reddit accused Vana of exploiting its data export controls in a statement to the media.