documentation

Date added: 15.04.26

Mozilla Common Voice and Speech Technologies (Kathy Reid) – Recording and transcript

In this lecture, Kathy Reid explores Mozilla Common Voice, language bias in speech tech and Mozilla Data Collective’s approach to ethical data stewardship for cultural organisations.

Contributor:
- Kathy Reid->
Title:
Mozilla Common Voice and Speech Technologies (Kathy Reid) – Recording and transcript
Last updated:
12.05.26
Description:
A presentation by Kathy Reid on Mozilla Common Voice, language bias and fairer ways to steward cultural data.
Tags:
Experiment:
Speech systems optimisation->

Listen to the lecture recording->

This lecture was delivered on Wed 18 March 2026 in the Create Quarter at State Library Victoria.

Transcript

*This edited machine generated transcript may contain errors.

Kathy Reid

I'm Kathy Reid. I work as the head of applied AI and ML (Machine Learning) at the Mozilla Data Collective, and I'm going to be talking today about two very interesting projects from Mozilla: the Mozilla Common Voice program, which is a speech data piece of infrastructure, and I'll also be talking about the Mozilla Data Collective, which aims to really try and reshape the data ecosystem.

I'd also like to acknowledge Country, and I'd also like to acknowledge that I've traveled over Wadawurrung Country today from Geelong to be here. And in particular, I want to acknowledge that the languages once spoken on this land for 2,000 generations unbroken. I'll be giving the talk today in English, specifically Australian English, but it's something that we don't think about when we use English, just how Anglo-centric our language and our speech and our talks and our culture is. And I want to acknowledge that there are over 250 other languages spoken in Australia today, and that for many of the speakers of those languages, they're the very last speakers of those languages.

I've recently completed my PhD with ANU. My PhD was in the very fascinating topic to me of speech data and speech data bias. And I looked at things like how well the Whisper speech recognition algorithm runs with different accents of English. It does very, very well with' Australian. It does even better with New Zealand English, but it fails on a lot of other accents.

I'm a knitter which I suspect is a hobby for many others in this community as well. Please give your dog a belly rub for me. I'm a huge dog lover, and being in a library, I thought it was really appropriate to let you all know what I'm reading at the moment.

So I got into technology because of William Gibson. So I read Neuromancer, the 1984 Hugo Award-winning novel that introduced the term "cyberspace", and I was hooked. And I've been a huge William Gibson fan ever since. And now I'm reading Martha Wells and the Murderbot series. I just think it's incredible. And I've been on a bit of a Becky Chambers binge.

The last few months. I've just finished reading her Monk & Robot series. It's one of the best pieces of writing I’ve read in the last few years. So I'm a knitter, I'm a dog lover and I'm a reader. So today I'll be talking a little bit about the Common Voice platform. I'll be talking a little bit about the Mozilla Data Collective platform, and I also want to, provide a lot of time for questions.

So first up, I want to take you back to 2017. In 2017, the Machine Learning that we know now is still very much in its infancy. ChatGPT hadn't been invented yet. ChatGPT didn't come out until about 2021/22. Whisper Speech Recognition wasn't released until August 2022. So we're still five years before the mainstream AI that we see today. But Machine Learning was really starting to take off, and we were starting to see some of the foundational work that we rely on today. And in 2017, the key to accelerating Machine Learning was having more data. So the algorithms that we use for building speech recognition products in 2017, are very different to the algorithms we used today.

They're considered incredibly outdated by several generations, just in nine years. But at that time, the bottleneck for making speech recognition work better was data. We didn't have enough speech data that was labeled and transcribed to build speech recognition models. And so there were very few openly accessible speech datasets. In fact, the really the only openly accessible speech dataset that we had is something called LibriSpeech.

And LibriSpeech has an incredible story behind it. LibriSpeech was never intended as a speech recognition database. So LibriSpeech is all of the data that came out of another open culture project called LibriVox. LibriVox is an open audiobook project where volunteers read out of copyright works like those from Project Gutenberg, to create free audio resources.

And the LibriSpeech dataset, which was never intended as a speech recognition dataset, came out of the LibriVox project. And so the only real speech recognition data that we had at the time was never, ever intended for speech recognition data. And so the problem that we had was we had this bottleneck where we're bottlenecked on having data to create a Machine Learning model.

So we had a data scarcity problem. And the big problem that we had with speech recognition in 2017. We had some speech recognition models that worked in some languages – English, French, Spanish, Russian, Japanese – but we didn't have anything for Kiswahili. Or Kinyarawanda or Bahasa Indonesia, or Bahasa Sunda or Bahasa Java.

There are 7,000 languages still spoken in the world today and in 2017 there were 7,200. And we had speech recognition tools for maybe 100 to 150. So, Common Voice was born out of us, out of a desire to help fill the data bottleneck and to help create speech recognition in more of the world's 7,000 languages.

When I talk about languages that don't have a lot of speech resources, often we'll call those under-resourced languages or we'll call them low resource languages. And I don't like that term. I am being political. The reason I don't like that term is because calling a language under-resourced or calling it a low resource language, frames language data as a commercial resource, as a capital resource.

If I call them underserved languages, it flips the script a little bit because I like to think that technology serves the language. Language doesn't serve technology. So that's why I use the term underserved languages.

So 2017, 7,200 languages were spoken. Very few speech recognition tools. 8 billion people, 6 billion of whom don't speak English. What are we going to do?

The first release of Common Voice was at the end of 2017, and it was released in two languages English and Mandarin. Common Voice released 500 hours of speech data in English which was donated by 20,000 speakers from across the world, all reading sentences and recording the sentences that were uploaded to a platform. That doesn't sound like a lot of data.

Now, considering that tools like Whisper were trained on 680,000 hours of speech data, we don't know exactly where Whisper got all that speech data from, probably YouTube.

At the time, we were able to train a new speech technology, a new speech recognition engine called Deep Speech. Deep Speech is now considered an artefact. It's a museum piece in the history of Machine Learning. So it's incredibly outdated. But at the time it was state of the art and Deep Speech could get to a word error rate.

So an accuracy rating for speech recognition of about 8 to 9 words out of 10, which sounds absolutely rubbish. Now, because we have the error rate to less than 3%. But at the time that was considered state of the art and it was considered incredible. But moreover, the proof point of Deep Speech and the proof point of Common Voice was that we could use the same approach to gathering data and creating speech recognition models for many languages.

So we didn't have to create a different pipeline, a different set of tooling for different languages. We could collect data in English, train English; collect data in Mandarin; train Mandarin, collect data in Kinyarawanda, train Kinyarawanda. So we were able to scale speech recognition to many more languages through this task.

One of the things you might not know about Deep Speech, the first author of the Deep Speech paper is a person called Dario Amodei. You might be familiar with that name because he's now the CEO of Anthropic. And the key thing to take away is that Machine Learning technologists tend to have long histories, and they pop up in different spots from time to time.

So over the last nine years, Common Voice has had a release almost every three months. And we've gone from two languages in 2017 to 350 languages as of this week. It's still not 7000 languages. It's still not every language in the world, but 350 is a lot more than two. We now support languages as diverse as Pashto in Afghanistan and Pakistan.

This year, this week we brought on Bahasa Sunda and Bahasa Java, which are both spoken in Indonesia as first languages by over 100 million people. We've brought on a lot of languages that are endangered in the Pakistani highlands – Balochistan, the languages of Balochistan. We've brought on board indigenous, endangered South American languages. So, a lot of Oaxaca and Nahuatl languages in Mexico and central Mexico. It's still not 7000, but it's a lot more than two.

Common Voices also worked very closely with philanthropic organisations. So in 2021, we were funded to the tune of USD$3.5 million to create speech technologies in the Kiswahili language. Kiswahili is spoken by about 100 to 150 million people in Tanzania, Kenya, Uganda, even South Africa. But we had less speech recognition support for for Kiswahili which is a trading language – it's the lingua franca in that region – than we did for Icelandic. Icelandic is spoken by 330,000 people in Iceland, but the GDP of Iceland is a lot higher than the GDP of Kenya and the GDP of Tanzania. And this is why I reframe technology serving languages rather than languages being seen as a capital resource. Money speaks. And we'd like to change that equation a little bit.

One thing I want to talk about that we've done in the last couple of years with Common Voice is introduced a lot of functionality around accents and variants. And I want to distinguish what an accent is and a variant is. So you can probably tell that I speak with an Australian accent or an ‘Austraaayan’ accent, and I say things like "vase" to rhyme with "Mars".

I don't say "vase" to rhyme with "maize". So you can tell that I'm not American. So that's phonetic, that's acoustic properties of my my speech. And you can tell that I'm probably someone who identifies as female because of the pitch of my voice. If I identified as male, I might drop my tone by about 20kHz and I might speak in a much deeper tone of voice.

So you can identify things like gender expression from my voice. They're all acoustic properties. They're all parts of my accent. But we also have a lot of lexical variety, a lot of variance in how we speak in different languages. So I could say something to you like "there's been a bingle at Broady. The Western’s chocka back to the servo . I'm going to be late for bevvies at Tommo’s". You can understand what I'm saying, right? But if I say that to my American colleagues, they just wouldn't get it. "The silly Australian speaking Australian again". So we have variance in the words that we use and the phrases that we use.

I'm nearly 50, and so when I hear younger people speak, when I hear the youths speak, and they say things like "six, seven" or "skibidi ohio rizz like no cap or slay", well, obviously I'm confused because I'm old, right? But it's also a different pattern of speaking. People are choosing to use different words and different phrases. And the key thing here is that language is evolving. Language is ever changing. And if we want our automated systems to be contemporary, to keep up with the language that we use, to keep up with the accents that we use, those technologies need data that speaks that language as well.

We implemented accents in Common Voice in 2024 aspart of my PhD work to analyse the coverage of accents in Common Voice. And each of those circles [points to diagram] represents a distinct accent in a Common Voice English data set. There are 15,000 unique representations of accent in Common Voice and part of my PhD work was to normalise those to about 285 and link them together.

So we know that many people who speak with an English accent also have an Australian accent, and we know that some people describe their accent very differently. When we give people the ability to describe their own accent rather than have a so-called expert describe their accent for them, some of the really interesting findings there were that people often describe their accent using a very regional indicator. People would say, "I have an Eastern European accent", not "I have a Russian accent", which shows some of the links between language and power and geopolitics. So that's one of the key pieces of functionality we have in Common Voice now. And with the variant function, we're able to navigate political tensions a lot more easily. So for example, in northwestern China, there's a region, the Uyghur region, which is a very unique, distinct ethnic population that has its own language, Uyghur. The Uyghur language can be written in either Arabic script or it can be written in Cyrillic script.

And one of the changes we implemented in Common Voice; we did not want to take a political position on how a language community's language should be represented, we wanted to provide tools for people to represent their own language, so we changed how Common Voice worked, so that the same speech could be represented in two different writing systems. So Uyghur in Common Voice is now recorded in both Persio-Arabic script and Cyrillic script. And it meets the needs of two different language communities. And that means that we're able to stay, I wouldn't call it politically neutral because political neutrality is also a political position, but it's a political position of supporting the broadest language group possible. So that's a little bit of our work on accents and variants.

How does Whisper do? Has anyone used Whisper speech recognition before? Yeah. How did it work for you? Pretty good. Where did it fight? Where did it fall down?

Audience Member

Indigenous, First Nations Australian terms.

Kathy Reid

Yeah. Definitely. And one of the broader problems we have with speech recognition systems like Whisper is the data that they're training. So let's say, I guess I don't know for sure that Whisper was trained on YouTube, If we think about the data that's in YouTube, "please like and subscribe", at the end of the channel, Whisper will often hallucinate things like, "thank you for listening".

There's not as much indigenous content on YouTube. And so the data that things are trained on has huge implications for the models that are built. And the other thing that Whisper tends to get wrong is Australian place names. So I was living in Canberra during the pandemic in 2021, and you might remember the Canberrans meme when Andrew Barr, the I don't know what he is, chief Minister, I think he stood up and Canberrans had been really good at like complying with all of the compliance for COVID because well done Canberrans.

And the transcript at the bottom of the news article said "well done, Ken Behrens". Ken Behrens is a Madagascan wildlife photographer. It got it wrong because there was more training data that had Ken Behrens the Madagascan wildlife photographer, than people saying Canberrans because it just isn't common. We don't get the Chief Minister congratulating Canberrans very often. And so as part of my PhD work, I use the Common Voice accent work to figure out how well Whisper did on particular accents because when Whisper was released, it was evaluated on languages, like how well did it work on particular languages?

We know the languages are not monoliths. I speak English differently to Geordies. I speak English differently to people who come from other countries. Whisper does pretty well with Australian English, and it does even better with New Zealand English from Aotearoa. If you speak English with a Malaysian accent or a Singaporean accent or Thai accent, Whisper fails abysmally.

And so this told us that there are things that we need to do to improve speech recognition technologies. Even if you speak English, if you speak English with particular accents, these technologies won't work well for you. And so it's an equity and diversity issue.

One of the things I want to talk about here is what the implications are for GLAM from speech recognition and language diversity. And there's a couple of pieces here. A lot of cultural institutions do oral histories and oral archives, but more than ever we're reliant on speech recognition and transcription to do transcripts of those oral archives and oral histories. And if those oral histories are in accents that are not well understood by speech recognition, they're not going to be transcribed well or they will have errors in them – Ken Behrens not Canberrans. And I've seen some horrendous transcriptions of Australian Indigenous pieces. So for a long time Whisper would transcribe Gadigal land in Sydney as part of the Eora nation, Whisper would transcribe that as "urination". So not only wrong but deeply culturally offensive. So oral histories are one connection. And we also we also know that we need more diverse data to create technologies that work for more diverse people as well. So as libraries start to use more AI and more Machine Learning. We need to check that they're working for all of the people that we serve. Not just some people.

And so diversity, equity, inclusivity are core missions of GLAMs. And so there's a values alignment here with what we're doing. Libraries already champion intellectual freedom. They champion rights to knowledge. They champion diversity and knowledge.

One question that I'd like to sort of end on the Common Voice piece, is what do we lose if we can't represent everyone? We're losing indigenous languages globally at a rate of one a fortnight. And in fact, there's a massive UNESCO effort called the Indigenous Languages Decade (2022-2032) which is aimed at stemming the loss of indigenous languages worldwide. If we don't capture speech data now from many of these indigenous languages, we won't have an opportunity to do so. Of the 250 languages still spoken in Australia, over 100 of those are spoken by the last speakers of those languages. Another 20 years and they won't be spoken.

What I am going to talk about a little bit is Mozilla Data Collective. So if you've worked in ML or AI or the data space, you're probably really familiar with the concept of data extraction and data harvesting. Does anyone here work with a web server where you have to block scrapers or bots all day?

I can see it's a familiar thing. We think data harvesting is really extractive. We think it's a very one sided equation. One company gets the data. There's no benefit going to the people who created the data or curated the data, or cleaning the data, which is often GLAM institutions. And it also means that a small group of players gets to play in the AI space.

And one of the things that you've probably started to feel as a cultural institution, as a GLAM institution, is that you're probably starting to feel a little bit powerless. In terms of which AI company do I go with? It's reframing the question. It's not do I go with AI? Can I grow my own? We're starting to have our choices in the AI space constrained.

And billions of people are not represented. Most of the language on the Internet is English. Most of the language on the internet is a particular style. Have you read Reddit? It doesn't represent many different forms of culture, many different forms of expression. It underrepresents many cohorts. And so as generative AI reproduces the data that it's trained on and reproduces the biases and reproduces the cultural frames of reference, we’re not representing many people in our AI future and current approaches to data extraction also bulldoze rights.

So things like the care principles for respect of indigenous archives and indigenous governance are getting bulldozed. Libraries were one of the first cultural institutions to release huge amounts of data. So I'm thinking here of Trove and the huge amount of data that is released via Trove under a Creative Commons license. Trove has been scraped to create AI models and ML models, but none of the benefits of those models go back to the institutions, or the organisations or the individuals who did all the data work. ChatGPT doesn't pay you a royalty every time an answer uses something from Trove – be nice, wouldn't it?

And so our vision: we want a multimodal, multicultural, multilingual AI future. We want a future that isn't just English, that doesn't just represent Western worldviews. And so we're building a platform where people who create data can get paid for that data. And so we want to create a platform of ethical data sharing under conditions that people control, under conditions where your data isn't scraped and used without your consent. And we want to make sure that the people using that data pay a fair price for it.

And so at the moment we've got 16,000 users. We have had 53,000 downloads. And we're talking to a lot of big players in the space about how we can work with them to create a better data future, create a data future that's more representative, that's more diverse.

What I will do is pass over the microphone and I will answer any questions you have.

Audience Member

One of these things that I guess always comes up with a lot of these discussions around the ethics around this stuff is the classic utilitarian quote of using the master’s tools to dismantle the master’s house. We've seen what these AI companies do with this data. They create things and this value doesn't come back. Are we sort of playing to their benefit and sort of creating a sort of hierarchy where they're still in control of our labour and they are still in control of information by sort of just creating these systems?

Kathy Reid

Fantastic question and I love the Audre Lorde quote too, Masters House. Masters tools can never be used to destroy the master's house. And I think if I'm reframing your question, am I using the Masters tools to try and destroy the master's house? Partially yes, and partially no. At the moment, it's very, very difficult for non big AI players to create models in the ML and AI space because the need for data is so huge.

And so in order to create something that is like a Mistral or a Deep Seek, I have to first of all scrape huge amounts of data off the web. The only thing that's ready made is Common Crawl, that is intended for that purpose. And because of the large scale scraping, a lot of the people who used to be okay with having data in Common Crawl have said, no, we're no longer okay with having that data in Common Crawl, and have pulled out of that initiative. So I think one of the ways we can destroy the master's house is by having a different set of tools and by sharing those tools with other people, because it's not just about having a different set of tools, it's about knowing how to use those tools against master and [unintelligible]

Yeah. Pedagogy of the oppressed here. [unintelligible]. And so I think we're giving people tools that force big AI players to create revenue streams for the data that they're using, instead of just scraping it and instead of just using it. So, for example, I look at ChatGPT and I look at Claude and I look at Gemini, just wholesale scraping of of web servers. And yes, copyright allows that. It's not technically illegal because that copyright laws haven't kept up with the technologies of the 21st century. But let's have a go at trying a different system. That may not work, but let's try something, because the other part of oppression is a mindset and thinking that you can't change the system. So yeah, excellent quote. Excellent question.

Audience Member

I wonder if you could give us any case studies or examples of how libraries and other GLAM organisations are getting involved revenue generation.

Kathy Reid

So I can't name names at the moment because we don't have signed agreements, but we are in talks with major GLAM institutions, both in the US and one national institution in Australia, to set up a revenue stream. And that's driven by a couple of different pieces.

One is that a lot of GLAMs have incredible data resources that have been curated, label tagged, and an incredible amount of public money and public effort has gone into creating those collections. And at the same time, many institutions are facing increased funding pressure for a variety of reasons difficult economic climate, political shifts. So there's two pieces that they need.

They're sitting on incredibly useful data, but they also have no mechanism at the moment to turn that into a revenue stream, which could then be used to fund additional public programs or expansion or building works, because, you know, maintenance is really expensive. And so that's what we're in talks with, with two majors in the US and one in Australia at the moment.

But yeah, different institutions, different histories, different data. So, some institutions have incredible challenges around what are obligations and rights for servicing indigenous data, particularly where there may be no indigenous family members or indigenous cultural presence left because of forces of colonisation and because of erosion of language and culture. And so that's a key challenge. How do we how do we surface data that may have indigenous implications, respectfully, in a culturally appropriate way that respects care protocols for indigenous data?

That's a huge challenge for Australian organisations. And it's also an increasing challenge for many North American organisations, particularly, in Canada, who have a very strong indigenous, heritage and culture, and increasingly in North America, where the indigenous culture that did exist for for a long period of time is being increasingly recognised and given increasing weight.

So that's one of the key challenges for cultural institutions. And one of the challenges for us as a platform is how do we provide the tools for things like datasets, supply chains and data governance and dataset provenance that respect indigenous protocols, where those protocols may be different, even in different parts of the same country.

They may be different indigenous protocols to uphold. So that's a key challenge for us as well. But yeah, it's there are many different GLAM institutions across the world. They all tend to be facing very similar challenges, unfortunately. And one more challenge that I think we are facing is the challenge of truth. So GLAMs are generally institutions of cultural record and institutions of cultural heritage, and they record truths from many different perspectives.

There are many different truths from many different angles. But in an age of generative AI, how do we determine what is the truth and what is a true cultural record? And that's also something that we're grappling with on the platform. How do we ensure truthfulness of the data that we're storing? Because if we can't guarantee the truthfulness of the data, we can't make claims about what is generated from that data.

Audience Member

I have another very different question. [unintelligible] from audio semantic meaning, semantic text. But my understanding of a lot of the sort of, you know, dictation tools proposes .... about extracting phonetics. [unintelligible] There's a lot of obviously attention into Machine Learning in neural network and stuff. But is there some similar movements around phonetic generation?

Kathy Reid

So first of all, great question. Phonetics are the building blocks of speech and in English we have 44 building blocks. So when I say the word bat, that's one syllable. So that's one, but it's actually three phonemes. ‘beh’’, ‘ah’, ‘t, and so in speech recognition systems in days gone by, what would happen is that the speech recognition system would get the audio recording and would try and break the audio recording down into separate phonemes.

And then what would happen is that you would have a model that went over the phonemes and said, well, these two phonemes together spell out this word. And these two phonemes together spell out this word. Modern speech recognition systems like Whisper work in a completely different way. They work with a different Machine Learning architecture called transformers. Transformers uses, a very different approach.

It uses links that it can find between parts of words to predict the next word. That's going to come out of a sequence. What's the best way I can explain it without a whiteboard and far too much maths? But at the moment we don't have very good grounding into phonetics for Whisper and so because when we have accent variation in speech, it's the phonetics that are changing.

I can say the same sentence with an Australian accent, an Australian accent, or I can say the same sentence with, you know, a Boston accent, or I can say it with the Geordie accent, but it's the same words, but it's the phonetics get tripped up. It's a it's a really good question, but there's not a lot of work going into phonetics at the moment.

Could there be? Potentially, but I don't think it would work well with the underlying architecture. The transformer models that we use for things like Whisper and Word today.

Audience Member

I'm wondering about the [unintelligible]

Kathy Reid

So I'm getting some incredible questions today and both of those questions are great questions. The first question you asked is really about how can we map, how can we create a link between an input of data into a model and the output that the model produces?

So how do we determine exactly. And that's an area of study called explainable AI. So one of the challenges that we have with AI models at the moment is that because under the hood they all use probabilities and statistics, it's very difficult for us to understand the influence of one data item in the training set to how the model is created, but there's a lot of work going on at the moment in the field of explainable AI to better understand the relationship between if we change data inputs, how does it change the outputs of a model?

So explainable AI is the field of research. And there's many different ways to do that. The second question is about how could we modularise different models so that they're using different pieces. And there's two things that I want to talk about there. There's a technology called a mixture of experts or an ensemble model, where we have multiple Machine Learning models that work together, and what they do is they operate as like a council of seniors or a council of wise people.

And they say, well, actually, this model thinks that this model thinks that this model has a different opinion, and we're going to aggregate all the models. We want to give us this output and then take the majority answer. So that's an ensemble model or a mixture of experts model, which we're using a lot to, have outputs from diverse models, and try and get to an aggregate or a consensus answer.

Yeah. But the problem is it's always regression to the mean. It's always the average answer. It's always what multiple models think is the average. So we get further and further away from outliers or new things. And then the second piece is something that that's called federated learning. So federated learning is something that we're looking at for very sensitive data sets.

So in speech recognition children's speech is very well guarded and very well protected. And the reason that we guard and we protect children speech is because we don't want to create synthetic speech of children, because it can be used in ways where we don't want it to be used. I don't want to talk too much about that, but there are reasons we don't allow children synthetic voices to be generated.

So if we want to create a children's synthetic voice for a very legitimate well-governed purpose, we need ways to protect the sensitive data that it's trained on and what federated learning does, what federated federated Machine Learning does is it trains on data that is never seen by the model. So the model asks the data a question and it will get a computed result or a computed outcome, but it never actually sees the data.

And it's so I can have children speech here and children speech here and create a model. But I don't ever see that data and the privacy of the people who, provide that data is protected and guarded. So we're seeing federated Machine Learning being used for very high sensitive, very highly sensitive data. So data that is governed by privacy law or by regulations by sort of medical pieces and protected and vulnerable populations.

Audience Member

So yeah, it seems like there's a tension between those two things.

Kathy Reid

But definitely, Machine Learning is trade offs all the way down. It's a perfect discipline for trade offs all the way down.