In the back room of an old and graying building in the northernmost region of New Zealand, one of the most advanced computers for artificial intelligence is helping to redefine the technology’s future.
Te Hiku Media, a nonprofit Māori radio station run by life partners Peter-Lucas Jones and Keoni Mahelona, bought the machine at a 50% discount to train its own algorithms for natural-language processing. It’s now a central part of the pair’s dream to revitalize the Māori language while keeping control of their community’s data.
Mahelona, a native Hawaiian who settled in New Zealand after falling in love with the country, chuckles at the irony of the situation. “The computer is just sitting on a rack in Kaitaia, of all places—a derelict rural town with high poverty and a large Indigenous population. I guess we’re a bit under the radar,” he says.
The project is a radical departure from the way the AI industry typically operates. Over the last decade, AI researchers have pushed the field to new limits with the dogma “More is more”: Amass more data to produce bigger models (algorithms trained on said data) to produce better results.
The approach has led to remarkable breakthroughs—but to costs as well. Companies have relentlessly mined people for their faces, voices, and behaviors to enrich bottom lines. And models built by averaging data from entire populations have sidelined minority and marginalized communities even as they are disproportionately subjected to the technology.
Over the years, a growing chorus of experts have argued that these impacts are repeating the patterns of colonial history. Global AI development, they say, is impoverishing communities and countries that don’t have a say in its development—the same communities and countries already impoverished by former colonial empires.
Peter-Lucas Jones (left) and Keoni Mahelona (right) attend an Indigenous AI Workshop in 2019.
This has been particularly apparent for artificial intelligence and language. “More is more” has produced large language models with powerful autocomplete and text analysis capabilities now used in everyday services like search, email, and social media. But these models, built by hoovering up large swathes of the internet, are also accelerating language loss, in the same way colonization and assimilation policies did previously.
Only the most common languages have enough speakers—and enough profit potential—for Big Tech to collect the data needed to support them. Relying on such services in daily work and life thus coerces some communities to speak dominant languages instead of their own.
“Data is the last frontier of colonization,” Mahelona says.
In turning to AI to help revive te reo, the Māori language, Mahelona and Jones, who is Māori, wanted to do things differently. They overcame resource limitations to develop their own language AI tools, and created mechanisms to collect, manage, and protect the flow of Māori data so it won’t be used without the community’s consent, or worse, in ways that harm its people.
Now, as many in Silicon Valley contend with the consequences of AI development today, Jones and Mahelona’s approach could point the way to a new generation of artificial intelligence—one that does not treat marginalized people as mere data subjects but reestablishes them as co-creators of a shared future.
Like many Indigenous languages globally, te reo Māori began its decline with colonization.
After the British laid claim to Aotearoa, the te reo name for New Zealand, in 1840, English gradually took over as the lingua franca of the local economy. In 1867, the Native Schools Act then made it the only language in which Māori children could be taught, as part of a broader policy of assimilation. Schools began shaming and even physically beating Māori students who attempted to speak te reo.
In the following decades, urbanization broke up Māori communities, weakening centers of culture and language preservation. Many Māori also chose to leave in search of better economic opportunities. Within a generation, the proportion of te reo speakers plummeted from 90% to 12% of the Māori population.
In the 1970s, alarmed by this rapid decline, Māori community leaders and activists fought to reverse the trend. They created childhood language immersion schools and adult learning programs. They marched in the streets to demand that te reo have equal status with English.
To support MIT Technology Review’s journalism, please consider becoming a subscriber.
In 1987, 120 years after actively supporting its erasure, the government finally passed the Māori Language Act, declaring te reo an official language. Three years later, it began funding the creation of iwi, or tribal, radio stations like Te Hiku Media, to publicly broadcast in te reo to increase the language’s accessibility.
Many Māori I speak to today identify themselves in part by whether or not their parents or grandparents spoke te reo Māori. It’s considered a privilege to have grown up in an environment with access to intergenerational language transmission.
This is the gold standard for language preservation: learning through daily exposure as a child. Learning as a teen or adult in an academic setting is not only harder. A textbook often teaches only a single, or “standard,” version of te reo when each iwi, or tribe, has unique accents, idiomatic expressions, and embedded regional histories.
Language, in other words, is more than just a tool for communication. It encodes a culture as it’s passed from parent to child, from child to grandchild, and evolves through those who speak it and inhabit its meaning. It also influences as much as it is influenced, shaping relationships, worldviews, and identities. “It’s how we think and how we express ourselves to each other,” says Michael Running Wolf, another Indigenous technologist who’s using AI to revive a rapidly disappearing language.
“Data is the last frontier of colonization.”
Keoni Mahelona
To preserve a language is thus to preserve a cultural history. But in the digital age especially, it takes constant vigilance to yank a minority language out of its downward trajectory. Every new communication space that doesn’t support it forces speakers to choose between using a dominant language and forgoing opportunities in the larger culture.
“If these new technologies only speak Western languages, we’re now excluded from the digital economy,” says Running Wolf. “And if you can’t even function in the digital economy, it’s going to be really hard for [our languages] to thrive.”
With the advent of artificial intelligence, language revitalization is now at a crossroads. The technology can further codify the supremacy of dominant languages, or it can help minority languages reclaim digital spaces. This is the opportunity that Jones and Mahelona have seized.
Long before Jones and Mahelona embarked on this journey, they met over barbecue at their swimming club’s member gathering in Wellington. The two instantly hit it off. Mahelona took Jones on a long bike ride. “The rest is history,” Mahelona says.
In 2012, the pair moved back to Jones’s hometown of Kaitaia, where Jones became CEO of Te Hiku Media. Because of its isolation, the region remains one of the most economically impoverished of Aotearoa, but by the same token, its Māori population is among the country’s best protected.
Over its 20-odd years of broadcasting history, Te Hiku had amassed a rich archive of te reo audio materials. It includes gems like a recording of Jones’s own grandmother Raiha Moeroa, born in the late 19th century, whose te reo remained largely untouched by colonial influence.
Jones saw an opportunity to digitize the archive and create a more modern equivalent of intergenerational language transmission. Most Māori no longer live with their iwis and can’t rely on nearby kin for daily te reo exposure. With a digital library, however, they’d be able to listen to te reo from bygone elders whenever and wherever they wanted.
The local Māori tribes granted him permission to proceed, but Jones needed a place to host the materials online. Neither he nor Mahelona liked the idea of uploading them to Facebook or YouTube. It would give the tech giants license to do what they wanted with the precious data.
(A few years later, companies would indeed begin working with Māori speakers to acquire such data. Duolingo, for example, sought to build language-learning tools that could then be marketed back to the Māori community. “Our data would be used by the very people that beat that language out of our mouths to sell it back to us as a service,” Jones says. “It’s just like taking our land and selling it back to us,” Mahelona adds.)
The only alternative was for Te Hiku to build its own digital hosting platform. With his engineering background, Mahelona agreed to lead the project and joined as CTO.
The digital platform became Te Hiku’s first major step to establishing data sovereignty—a strategy in which communities seek control over their own data in an effort to ensure control over their future. For Māori, the desire for such autonomy is rooted in history, says Tahu Kukutai, a cofounder of the Māori data sovereignty network. During the earliest colonial censuses, after a series of devastating wars in which they killed thousands of Māori and confiscated their land, the British collected data on tribal numbers to track the success of the government’s assimilation policies.
Data sovereignty is thus the latest example of Indigenous resistance—against colonizers, against the nation-state, and now against big tech companies. “The nomenclature might be new, the context might be new, but it builds on a very old history,” Kukutai says.
In 2016, Jones embarked on a new project: to interview native te reo speakers in their 90s before their language and knowledge was lost to future generations. He wanted to create a tool that would display a transcription alongside each interview. Te reo learners would then be able to hover on words and expressions to see their definitions.
But few people had enough mastery of the language to manually transcribe the audio. Inspired by voice assistants like Siri, Mahelona began looking into natural-language processing. “Teaching the computer to speak Māori became absolutely necessary,” Jones says.
But Te Hiku faced a chicken-and-egg problem. To build a te reo speech recognition model, it needed an abundance of transcribed audio. To transcribe the audio, it needed the advanced speakers whose small numbers it was trying to compensate for in the first place. There were, however, plenty of beginning and intermediate speakers who could read te reo words aloud better than they could recognize them in a recording.
So Jones and Mahelona, along with Te Hiku COO Suzanne Duncan, devised a clever solution: rather than transcribe existing audio, they would ask people to record themselves reading a series of sentences designed to capture the full range of sounds in the language. To an algorithm, the resulting data set would serve the same function. From those thousands of pairs of spoken and written sentences, it would learn to recognize te reo syllables in audio.
The team announced a competition. Jones, Mahelona, and Duncan contacted every Māori community group they could find, including traditional kapa haka dance troupes and waka ama canoe-racing teams, and revealed that whichever one submitted the most recordings would win a $5,000 grand prize.
The entire community mobilized. Competition got heated. One Māori community member, Te Mihinga Komene, an educator and advocate of using digital technologies to revitalize te reo, recorded 4,000 phrases alone.
Money wasn’t the only motivator. People bought into Te Hiku’s vision and trusted it to safeguard their data. “Te Hiku Media said, ‘What you give us, we’re here as kaitiaki [guardians]. We look after it, but you still own your audio,’” says Te Mihinga. “That’s important. Those values define who we are as Māori.”
Within 10 days, Te Hiku amassed 310 hours of speech-text pairs from some 200,000 recordings made by roughly 2,500 people, an unheard-of level of engagement among researchers in the AI community. “No one could’ve done it except for a Māori organization,” says Caleb Moses, a Māori data scientist who joined the project after learning about it on social media.
The amount of data was still small compared with the thousands of hours typically used to train English language models, but it was enough to get started. Using the data to bootstrap an existing open-source model from the Mozilla Foundation, Te Hiku created its very first te reo speech recognition model with 86% accuracy.
From there, it branched out into other language AI technologies. Mahelona, Moses, and a newly assembled team created a second algorithm for auto-tagging complex te reo phrases, and a third for giving real-time feedback to te reo learners on the accuracy of their pronunciation. The team even experimented with voice synthesis to create the te reo equivalent of a Siri, though it ultimately didn’t clear the quality bar to be deployed.
Along the way, Te Hiku established new data sovereignty protocols. Māori data scientists like Moses are still few and far between, but those who join from outside the community cannot just use the data as they please. “If they want to try something out, they ask us, and we have a decision-making framework based on our values and our principles,” Jones says.
It can be challenging. The open-source, free-wheeling culture of data science is often antithetical to the practice of data sovereignty, as is the culture of AI. There have been times when Te Hiku has let data scientists go because they “just want access to our data,” Jones says. It now seeks to cultivate more Māori data scientists through internship programs and junior positions.
Te Hiku has since made most of its tools available as APIs through its new digital language platform, Papa Reo. It’s also working with Māori-led organizations like the educational company Afed Limited, which is building an app to help te reo learners practice their pronunciation. “It’s really a game changer,” says Cam Swaison-Whaanga, Afed’s founder, who is also on his own te reo learning journey. Students no longer have to feel shy about speaking aloud in front of teachers and peers in a classroom.
Te Hiku has begun working with smaller Indigenous populations as well. In the Pacific region, many share the same Polynesian ancestors as the Māori, and their languages have common roots. Using the te reo data as a base, a Cook Islands researcher was able to train an initial Cook Islands language model to reach roughly 70% accuracy using only tens of hours of data.
“It’s no longer just about teaching computers to speak te reo Māori,” Mahelona says. “It’s about building a language foundation for Pacific languages. We’re all struggling to keep our languages alive.”
“Regardless of how widely spoken they are, languages belong to a people.”
Kathleen Siminyu
But Jones and Mahelona know there will come a time when they will have to work with more than Indigenous communities and organizations. If they want te reo to truly be ubiquitous—to the point of having te reo–speaking voice assistants on iPhones and Androids—they’ll need to partner with big tech companies.
“Even if you have the capacity in the community to do really cool speech recognition or whatever, you have to put it in the hands of the community,” says Kevin Scannell, a computer scientist helping to revitalize the Irish language, who has grappled with the same trade-offs in his research. “Having a website where you can type in some text and have it read to you is important, but it’s not the same as making it available in everybody’s hand on their phone.”
Jones says Te Hiku is preparing for this inevitability. It created a data license that spells out the ground rules for future collaborations based on the Māori principle of kaitiakitanga, or guardianship. It will only grant data access to organizations that agree to respect Māori values, stay within the bounds of consent, and pass on any benefits derived from its use back to the Māori people.
The license has yet to be used by an organization other than Te Hiku, and there remain questions around its enforceability. But the idea has already inspired other AI researchers, like Kathleen Siminyu of Mozilla’s Common Voice project, which gathers voice donations to build public data sets for speech recognition in different languages. Right now those data sets can be downloaded for any purpose. But last year, Mozilla began exploring a license more similar to Te Hiku’s that would give greater control to language communities that choose to donate their data. “It would be great if we could tell people that part of contributing to a data set leads to you having a say as to how the data set is used,” she says.
Margaret Mitchell, the former co-lead of Google’s ethical AI team who conducts research on data governance and ownership practices, agrees. “This is exactly the kind of license we want to be able to develop more generally for all different kinds of technology. I would really like to see more of it,” she says.
In some ways, Te Hiku got lucky. Te reo can take advantage of English-centric AI technologies because it has enough similarity to English in key features like its alphabet, sounds, and word construction. The Māori are also a fairly large Indigenous community, which allowed them to amass enough language data and find data scientists like Moses to help make their vision a reality.
“Most other communities are not big enough for those happy accidents to occur,” says Jason Edward Lewis, a digital technologist and artist who co-organizes the Indigenous AI Network.
At the same time, he says, Te Hiku has been a powerful demonstration that AI can be built outside the wealthy profit centers of Silicon Valley—by and for the people it’s meant to serve.
Te Hiku Media receives a New Zealand innovation award for its language revitalization work.
The example has already motivated others. Michael Running Wolf and his wife, Caroline, also an Indigenous technologist, are working to build speech recognition for the Makah, an Indigenous people of the Pacific Northwest coast, whose language has only around a dozen remaining speakers. The task is daunting: the Makah language is polysynthetic, which means a single word, composed of multiple building blocks like prefixes and suffixes, can express an entire English sentence. Existing natural-language processing techniques may not be applicable.
Before Te Hiku’s success, “we didn’t even consider looking into it,” Caroline says. “But when we heard the amazing work they’re doing, it was just fireworks going off in our head: ‘Oh my God, it’s finally possible.’”
Mozilla’s Siminyu says Te Hiku’s work also carries lessons for the rest of the AI community. In the way the industry operates today, it’s easy for individuals and communities to be disenfranchised; value is seen to come not from the people who give their data but from the ones who take it away. “They say, ‘Your voice isn’t worth anything on its own. It actually needs us, someone with a capacity to bring billions together, for each to be meaningful,’” she says.
In this way, then, natural-language processing “is a nice segue into starting to figure out how collective ownership should work,” she adds. “Because regardless of how widely spoken they are, languages belong to a people.”