Latest from MIT Tech Review – We finally have a definition for open-source AI

Open-source AI is everywhere right now. The problem is, no one agrees on what it actually is. Now we may finally have an answer. The Open Source Initiative (OSI), the self-appointed arbiters of what it means to be open source, has released a new definition, which it hopes will help lawmakers develop regulations to protect consumers from AI risks.

Though OSI has published much about what constitutes open-source technology in other fields, this marks its first attempt to define the term for AI models. It asked a 70-person group of researchers, lawyers, policymakers, and activists, as well as representatives from big tech companies like Meta, Google, and Amazon, to come up with the working definition.

According to the group, an open-source AI system can be used for any purpose without securing permission, and researchers should be able to inspect its components and study how the system works.

It should also be possible to modify the system for any purpose—including to change its output—and to share it with others to use, with or without modifications, for any purpose. In addition, the standard attempts to define a level of transparency for a given model’s training data, source code, and weights.

The previous lack of an open-source standard presented a problem. Although we know that the decisions of OpenAI and Anthropic to keep their models, data sets, and algorithms secret makes their AI closed source, some experts argue that Meta and Google’s freely accessible models, which are open to anyone to inspect and adapt, aren’t truly open source either, because of licenses that restrict what users can do with the models and because the training data sets aren’t made public. Meta, Google, and OpenAI have been contacted for their response to the new definition but did not reply before publication.

“Companies have been known to misuse the term when marketing their models,” says Avijit Ghosh, an applied policy researcher at Hugging Face, a platform for building and sharing AI models. Describing models as open source may cause them to be perceived as more trustworthy, even if researchers aren’t able to independently investigate whether they really are open source.

Ayah Bdeir, a senior advisor to Mozilla and a participant in OSI’s process, says certain parts of the open-source definition were relatively easy to agree upon, including the need to reveal model weights (the parameters that help determine how an AI model generates an output). Other parts of the deliberations were more contentious, particularly the question of how public training data should be.

The lack of transparency about where training data comes from has led to innumerable lawsuits against big AI companies, from makers of large language models like OpenAI to music generators like Suno, which do not disclose much about their training sets beyond saying they contain “publicly accessible information.” In response, some advocates say that open-source models should disclose all their training sets, a standard that Bdeir says would be difficult to enforce because of issues like copyright and data ownership.

Ultimately, the new definition requires that open-source models provide information about the training data to the extent that “a skilled person can recreate a substantially equivalent system using the same or similar data.” It’s not a blanket requirement to share all training data sets, but it also goes further than what many proprietary models or even ostensibly open-source models do today. It’s a compromise.

“Insisting on an ideologically pristine kind of gold standard that actually will not effectively be met by anybody ends up backfiring,” Bdeir says. She adds that OSI is planning some sort of enforcement mechanism, which will flag models that are described as open source but do not meet its definition. It also plans to release a list of AI models that do meet the new definition. Though none are confirmed, the handful of models that Bdeir told MIT Technology Review are expected to land on the list are relatively small names, including Pythia by Eleuther, OLMo by Ai2, and models by the open-source collective LLM360.

Latest from MIT Tech Review – Designing the future of entertainment

An entertainment revolution, powered by AI and other emerging technologies, is fundamentally changing how content is created and consumed today. Media and entertainment (M&E) brands are faced with unprecedented opportunities—to reimagine costly and complex production workloads, to predict the success of new scripts or outlines, and to deliver immersive entertainment in novel formats like virtual…

Artificial Intelligence

Latest from MIT Tech Review – Building connected data ecosystems for AI at scale

Post Content Share via: Facebook Twitter LinkedIn Email Copy Link Print More Related work from others: O’Reilly Media – From Habits to Tools

Artificial Intelligence

Latest from MIT Tech Review – Training robots in the AI-powered industrial metaverse

Imagine the bustling floors of tomorrow’s manufacturing plant: Robots, well-versed in multiple disciplines through adaptive AI education, work seamlessly and safely alongside human counterparts. These robots can transition effortlessly between tasks—from assembling intricate electronic components to handling complex machinery assembly. Each robot’s unique education enables it to predict maintenance needs, optimize energy consumption, and innovate…

Artificial Intelligence

Latest from MIT Tech Review – Google DeepMind is using Gemini to train agents inside Goat Simulator 3

Google DeepMind has built a new video-game-playing agent called SIMA 2 that can navigate and solve problems in a wide range of 3D virtual worlds. The company claims it’s a big step toward more general-purpose agents and better real-world robots. Google DeepMind first demoed SIMA (which stands for “scalable instructable multiworld agent”) last year. But…

Artificial Intelligence

Latest from MIT Tech Review – How AI taught Cassie the two-legged robot to run and jump

If you’ve watched Boston Dynamics’ slick videos of robots running, jumping and doing parkour, you might have the impression robots have learned to be amazingly agile. In fact, these robots are still coded by hand, and would struggle to deal with new obstacles they haven’t encountered before. However, a new method of teaching robots to…

Artificial Intelligence

Latest from MIT : Like human brains, large language models reason about diverse data in a general way

While early language models could only process text, contemporary large language models now perform highly diverse tasks on different types of data. For instance, LLMs can understand many languages, generate computer code, solve math problems, or answer questions about images and audio. MIT researchers probed the inner workings of LLMs to better understand how they…

Similar Posts