The Problem With Open Sourcing AI Datasets

5 min readSep 3, 2024

An open dataset with issues: paperswithcode.com/dataset/the-pile x

Godwin Josh asked: How do you envision fostering a culture of genuine openness within the AI community.

I don’t envision genuine openness within the subset of AI that is LLMs (Large Language Models) of which OpenAI’s ChatGPT is the most infamous product.

Open sourcing the data used in training LLMs must be a requirement of open source AI. Open source is more specific than openness; it is the use, study, modification, and redistribution of source. Anything less is openwashing.

But there is no safe and responsible way to open source the estimated tens of trillions of words used to train OpenAI’s GPT model line.

To understand the scale of the problem imagine the size of Wikipedia. After 23 years with millions of volunteers the English language portion is estimated at four billion words. You would need three thousand or more Wikipedias to match the datasets used by OpenAI.

How did OpenAI do it? They used datasets scraped from the world wide web. These datasets include everything from Wikipedia[2] to Reddit, from your government’s websites to Twitter, from The New York Times to The Sun. Hundreds of thousands of websites produced by over a billion people with every interest and intention. Think of a website you love, of a website you detest, of a website you are ashamed of…

The Problem With Open Sourcing AI Datasets

Written by Paul Watson