PARIS — This is as close as you can get to a rock concert in AI research. Inside the supercomputing center of the French National Center for Scientific Research, on the outskirts of Paris, rows and rows of what look like black fridges hum at a deafening 100 decibels.
They form part of a supercomputer that has spent 117 days gestating a new large language model (LLM) called BLOOM that its creators hope represents a radical departure from the way AI is usually developed.
Unlike other, more famous large language models such as OpenAI’s GPT-3 and Google’s LaMDA, BLOOM (which stands for BigScience Large Open-science Open-access Multilingual Language Model) is designed to be as transparent as possible, with researchers sharing details about the data it was trained on, the challenges in its development, and the way they evaluated its performance. OpenAI and Google have not shared their code or made their models available to the public, and external researchers have very little understanding of how these models are trained.
BLOOM was created over the last year by over 1,000 volunteer researchers in a project called BigScience, which was coordinated by AI startup Hugging Face using funding from the French government. It officially launched on July 12. The researchers hope developing an open-access LLM that performs as well as other leading models will lead to long-lasting changes in the culture of AI development and help democratize access to cutting-edge AI technology for researchers around the world.
The model’s ease of access is its biggest selling point. Now that it’s live, anyone can download it and tinker with it free of charge on Hugging Face’s website. Users can pick from a selection of languages and then type in requests for BLOOM to do tasks like writing recipes or poems, translating or summarizing texts, or writing programming code. AI developers can use the model as a foundation to build their own applications.
At 176 billion parameters (variables that determine how input data is transformed into the desired output), it is bigger than OpenAI’s 175-billion-parameter GPT-3, and BigScience claims that it offers similar levels of accuracy and toxicity as other models of the same size. For languages such as Spanish and Arabic, BLOOM is the first large language model of this size.
But even the model’s creators warn it won’t fix the deeply entrenched problems around large language models, including the lack of adequate policies on data governance and privacy and the algorithms’ tendency to spew toxic content, such as racist or sexist language.
Out in the open
Large language models are deep-learning algorithms that are trained on massive amounts of data. They are one of the hottest areas of AI research. Powerful models such as GPT-3 and LaMDA, which produce text that reads as if a human wrote it, have huge potential to change the way we process information online. They can be used as chatbots or to search for information, moderate online content, summarize books, or generate entirely new passages of text based on prompts. But they are also riddled with problems. It takes only a little prodding before these models start producing harmful content.
The models are also extremely exclusive. They need to be trained on massive amounts of data using lots of expensive computing power, which is something only large (and mostly American) technology companies such as Google can afford.
Most big tech companies developing cutting-edge LLMs restrict their use by outsiders and have not released information about the inner workings of their models. This makes it hard to hold them accountable. The secrecy and exclusivity are what the researchers working on BLOOM hope to change.
Meta has already taken steps away from the status quo: in May 2022 the company released its own large language model, Open Pretrained Transformer (OPT-175B), along with its code and a logbook detailing how the model was trained.
But Meta’s model is available only upon request, and it has a license that limits its use to research purposes. Hugging Face goes a step further. The meetings detailing its work over the past year are recorded and uploaded online, and anyone can download the model free of charge and use it for research or to build commercial applications.
A big focus for BigScience was to embed ethical considerations into the model from its inception, instead of treating them as an afterthought. LLMs are trained on tons of data collected by scraping the internet. This can be problematic, because these data sets include lots of personal information and often reflect dangerous biases. The group developed data governance structures specifically for LLMs that should make it clearer what data is being used and who it belongs to, and it sourced different data sets from around the world that weren’t readily available online.
The group is also launching a new Responsible AI License, which is something like a terms-of-service agreement. It is designed to act as a deterrent from using BLOOM in high-risk sectors such as law enforcement or health care, or to harm, deceive, exploit, or impersonate people. The license is an experiment in self-regulating LLMs before laws catch up, says Danish Contractor, an AI researcher who volunteered on the project and co-created the license. But ultimately, there’s nothing stopping anyone from abusing BLOOM.
The project had its own ethical guidelines in place from the very beginning, which worked as guiding principles for the model’s development, says Giada Pistilli, Hugging Face’s ethicist, who drafted BLOOM’s ethical charter. For example, it made a point of recruiting volunteers from diverse backgrounds and locations, ensuring that outsiders can easily reproduce the project’s findings, and releasing its results in the open.
This philosophy translates into one major difference between BLOOM and other LLMs available today: the vast number of human languages the model can understand. It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (such as Hindi), and 20 African languages. Just over 30% of its training data was in English. The model also understands 13 programming languages.
This is highly unusual in the world of large language models, where English dominates. That’s another consequence of the fact that LLMs are built by scraping data off the internet: English is the most commonly used language online.
The reason BLOOM was able to improve on this situation is that the team rallied volunteers from around the world to build suitable data sets in other languages even if those languages weren’t as well represented online. For example, Hugging Face organized workshops with African AI researchers to try to find data sets such as records from local authorities or universities that could be used to train the model on African languages, says Chris Emezue, a Hugging Face intern and a researcher at Masakhane, an organization working on natural-language processing for African languages.
Including so many different languages could be a huge help to AI researchers in poorer countries, who often struggle to get access to natural-language processing because it uses a lot of expensive computing power. BLOOM allows them to skip the expensive part of developing and training the models in order to focus on building applications and fine-tuning the models for tasks in their native languages.
“If you want to include African languages in the future of [natural-language processing] … it’s a very good and important step to include them while training language models,” says Emezue.
Handle with caution
BigScience has done a “phenomenal” job of building a community around BLOOM, and its approach of involving ethics and governance from the beginning is a thoughtful one, says Percy Liang, an associate professor in computer science at Stanford who specializes in large language models.
However, Liang doesn’t think it will lead to significant changes to LLM development. “OpenAI and Google and Microsoft are still blazing ahead,” he says.
Ultimately, BLOOM is still a large language model, and it still comes with all the associated flaws and risks. Companies such as OpenAI have not released their models or code to the public because, they argue, the sexist and racist language that has gone into them makes them too dangerous to use that way.
BLOOM is also likely to incorporate inaccuracies and biased language, but since everything about the model is out in the open, people will be able to interrogate the model’s strengths and weaknesses, says Margaret Mitchell, an AI researcher and ethicist at Hugging Face.
BigScience’s biggest contribution to AI might end up being not BLOOM itself, but the numerous spinoff research projects its volunteers are getting involved in. For example, such projects could bolster the model’s privacy credentials and come up with ways to use the technology in different fields, such as biomedical research.
“One new large language model is not going to change the course of history,” says Teven Le Scao, a researcher at Hugging Face who co-led BLOOM’s training. “But having one good open language model that people can actually do research on has a strong long-term impact.”
When it comes to the potential harms of LLMs, “ Pandora’s box is already wide open,” says Le Scao. “The best you can do is to create the best conditions possible for researchers to study them.”