Building better AI speech tech with a global community of voices – Kathy Reid

Voice assistants and auto-captions are everywhere now, but they can quickly fall apart when someone speaks with an accent the system wasn’t trained to recognise, or in a language it rarely “hears”. In this episode of the SLV LAB Conversations podcast, we were joined by technologist and researcher Kathy Reid who explained why that happens and what can be done about it. She introduces Mozilla Common Voice, a community-led project that collects speech in hundreds of languages and dialects. Kathy also talks about the work of Mozilla Data Collective and what it could mean for libraries and other cultural institutions facing constant scraping of online collections.

Speakers

Transcript

0:00:02 - 0:01:03
Sotirios Alpanis
Have you ever heard a voice assistant mispronounce a place name, or read an automated transcript packed with nonsense words that don't match what was actually said? You're not alone. It's a common problem affecting speech technologies. A lot of speech technology has been trained on a pretty narrow slice of how people around the world actually speak. For example, Google leans on YouTube videos to train its Gemini AI models and audio generation tools.So don't be surprised when your voice assistant starts sounding a little too much like your favourite YouTuber. In other words, AI voice technologies have a representation problem. I'm Sotirios Alpanis, Innovation Lead at State Library Victoria and you're listening to an SLV Lab podcast, part of its conversation series, where we discuss emerging technology, digital experimentation and library futures with artists, technologists and workers in the cultural sector.0:01:03 - 0:01:31
Kathy Reid
There are 8 billion people in the world, and there are 7,000 languages spoken in the world by a billion people. If I go back 2 or 3 years, there are 7,200 languages spoken. And there's a huge piece here about language loss and about cultural loss and identity loss. And I think a lot of the work that is done in Common Voice is driven by the desire to ensure that cultures survive, because culture is grounded in language.0:01:31 - 0:02:46
Sotirios Alpanis
In this episode, we spoke to technologist and researcher Kathy Reid. Kathy's the Head of AI and Machine Learning at Mozilla Data Collective, which aims to develop a fairer model for data stewardship, including how organisations like galleries, libraries, archives and museums can protect digital collections from indiscriminate scraping while still enabling responsible reuse and possibly generating some revenue.Kathy introduced the work of Mozilla Common Voice, a community led project to collect speech data across languages and dialects and make it available for building more representative voice technologies. We spoke about the gaps, variations, and biases in language, and we learned about Cathy's own pathway into voice tech and research. The following conversation was recorded on the unceded lands of the.Wurundjeri Woi Wurrung people of the Kulin Nation. We acknowledge the traditional lands of all the Victorian Aboriginal clans and their cultural practices and knowledge systems. We'd also like to acknowledge and pay our respects to any First Nation listeners joining us for this conversation.So to begin with, could you tell us a bit about Mozilla Common Voice and what it aims to do?0:02:46 - 0:05:07
Kathy Reid
So, Common Voice started around 2017. And if we cast back to 2017, which I can't believe is, you know, nine years ago now, there are a lot of things happening in artificial intelligence and machine learning around 2017. It's where we really saw this step change in how machine learning and artificial intelligence operated, and some of the things that we take for granted now, or words that are very familiar to us now, like Transformers or ChatGPT.So the T in ChatGPT is Transformers. Those things are actually just being built in 2017. So that's how far the field has moved in under a decade. And so in 2017, when we went to build speech technologies like speech recognition, which transcribes spoken audio into text, the tools that we had at that stage were very, very much built around needing a lot of speech data that was very well transcribed in order to build machine learning models.And one of the big issues that we had around 2017 was a data scarcity issue. So in 2017, it took around 10,000 hours of recorded speech to build a speech recognition model that had any sort of level of accuracy. And so Common Voice is really a response to a data scarcity problem. So the first release of Common Voice was at the end of 2017, and at the end of 2017 there was about 500 hours of English speech, which by today's standards isn't very much at all.If we consider that speech recognition programs like Whisper are trained on... well, Whisper was originally trained on about 680,000 hours of speech data. So, you know, 500 hours now doesn't seem like a lot. But it was a lot then. And that enabled a speech recognition program called Deep Speech to be created. At the time it was written in or worked for both English and Mandarin.And then Common Voice grew from there. So it expanded from the original two languages. And as of – well, I'm not sure if I'm officially allowed to to say this – but we're releasing Common Voice 25 this week and will have 350 languages supported from, you know, the two that we started with in 2017. So yeah, Common Voice is originally a response to a data scarcity problem.0:05:07 - 0:05:16
Sotirios Alpanis
So I guess Commoin Voice is a bit of an umbrella term for a platform, a dataset. Can you talk a bit about those those different aspects?0:05:16 - 0:09:49
Kathy Reid
So there's many elements that make up Common Voice. And I really want to underscore here that Common Voice isn't just a technical piece. Common Voice is also a community piece. So Common Voice is a platform that allows data contributors – people who speak a particular language or dialect – to contribute speech samples. And so originally Common Voice was designed to have in the industry.We call it elicited speech or read speech. And it's where I might go onto the platform and read out a sentence like “the quick brown folks jumped over the lazy dog”. And that's one type of data that's used for speech recognition. But we've also expanded the platform recently into spontaneous speech. And so when we're talking with each other like we are now, if I try to transcribe the speech that we're using to speak in the podcast at the moment, we're not using read speech.And I'm not talking like this like I'm reading from a book. We're having a really natural, free flowing conversation. And so speech recognition engines tend to work better if they're trained on spontaneous conversational speech. So that's something that we've added to the platform recently. So it's a platform for data capture or people to contribute speech samples. Common Voice is also a set of datasets.So every language that is supported in the Common Voice platform gets a dataset release about every three months. And so that dataset contains the speech samples. It contains the transcription. And if people have volunteered to give us their demographic data – so their gender expression, their age, the accent that they speak with – that's something that also forms part of the dataset, and that information is really important because, what we found is that speech recognition tends to have a lot of biases.So it tends to work better for some people rather than others. And a lot of those biases come from the data that the speech technologies were originally trained on. So Common Voice is a platform, Common Voice is a dataset, and Common Voice is also community. And so people don't just wake up one morning and think, gee, I'd like to go and, donate some speech samples to Common Voice.I wish people felt that way, but unfortunately they don't. And so we do a lot of work with communities or language communities to encourage speech contributions and to help people build up language resources in their own community. And as we've got some incredible examples of that. So, oh, one that's one that's reasonably recent. There's a language group, or a language community for the Pashto language, which is both in Afghanistan and Pakistan.Obviously, an area that seen a huge amount of conflict and challenges with economic development. And when I say the Common Voice has 350 odd languages in its infrastructure, the importance of what the Pashto community has done will will sort of shine through here. In less than two months Pashto has taken the number three spot on the Common Voice leaderboard for the amount of data that's been collected.So, the number one spot goes to the Catalan community from northern Spain. English is number two. And as of this month, Pashto is number three in terms of the amount of data that's being collected. And the Pashto community has done that over just, you know, 2 to 3 months of work. And that's been through community coordination, grassroots campaigns, really getting into the community and, letting the community know, hey, if you give us a sample of your voice, these are the speech technologies that are able to follow from that.So, yeah, Common Voice is a platform, Common Voice is datasets, Common Voice is community. And when we put a wrapper around, all of that Common Voice is shared public infrastructure. And I don't want to get too deep into the weeds here about theories of infrastructure, because I don't want everyone to fall asleep when I start talking about different French philosophers.But when I talk about infrastructure, what I really mean is technical, built technical environments that fade into the background that are taken for granted. And we don't know that they're there until they're missing or they're broken or they're not there. And so Common Voice is not just a platform. It's not just datasets. It's not just a community.Common Voice is a digital public good. It's digital public infrastructure.0:09:49 - 0:10:22
Sotirios Alpanis
One of the things that's really struck me, lurking on the Common Voice Discord, is you just see these people joining and talking about their passion for their particular language or the datasets they're working with. And I find myself going down rabbit holes, languages I've never heard of. And it's just, yeah, it really sort of adds a more human element to it.And you to see such enthusiasm for the for the work that's being done. And yeah, it's kind of really refreshing. And, and I think, you know, as someone who works and lives predominantly in English, it's a really good reminder that, you know, so much of the world doesn't converse in English.0:10:23 - 0:11:58
Kathy Reid
Oh, absolutely. I couldn't agree more, Sotirios. I think when we think about the stats, because I of course, I'm going to pull out some data here. There are 8 billion people in the world, and there are 7,000 languages spoken in the world by 8billion people. If I go back 2 or 3 years, there are 7,200 languages spoken. So there's a huge piece here about language loss and about cultural loss and identity loss.And I think a lot of the work that is done in Common Voice is driven by the desire to ensure that cultures survive because culture is grounded in language. So in English, we don't have words for things that are in other cultures, and other cultures have a lot more nuance and a lot more description for things that we can't express particularly well, you know, particularly well in English.And this is where technologies like machine translation that translate from one language to another often fall down. And I think of different idioms from different languages. And sometimes I think English is actually an incredibly clumsy language that lacks a lot of expressiveness. And so I think being able to recognise that there are 7,000 languages spoken in the world by 8 billion people and only 1.2 billion people speak English.So there's 6 billion people for whom, you know, 6 billion people who don't speak English.0:11:58 - 0:12:05
Sotirios Alpanis
So focusing a bit more on yourself, Cathy, what was your contribution or what is your ongoing contribution to the Common Voice?0:12:05 - 0:13:55
Kathy Reid
So I've bounced around a little bit with Common Voice. I think I've had two different stints with Common Voice. And, and now I'm in a third role with Mozilla Data Collective. So, when I started my PhD at ANU in 2020, which seems like the before times now, my original PhD topic was focusing on voice bias and bias in speech data that's used to build speech technology.And, it's through a series of incredibly serendipitous connections. One of my PhD advisors ended up being Dr Jofish Kaye, who at the time was, the Emerging Technologies Lead for Common Voice and so I was able to do some part time work with the Common Voice project, which was, you know, an incredible opportunity. And so my role there was as a voice data specialist, and I did a lot of work with metadata and how we were representing speech data.So some of my work there looked at, well, we have all of this language data and speech data in Common Voice, but whose voices are actually being represented. And when we started to do some of that analysis work, we realised that the many languages not all, there are some exceptions, off the top of my head with some of the Southern African languages, and I think Japanese are the exceptions, but for some, the most languages on Common Voice, most of the data samples are from younger people and from people who identify as male.And so when we think about things like representing older women in speech technologies, that's something that, you know, we have to consider, you know, how do we have more representation from older women in speech technology so that we don't inculcate or reproduce some of the societal biases that we already have.0:13:55 - 0:14:00
Sotirios Alpanis
What sorts of techniques have you used to try and overcome those sort of selection biases?0:14:00 - 0:15:55
Kathy Reid
So a lot of this comes down to... so Common Voice is a tiny team, you know, a handful of people who are working within Mozilla. This is where we really leverage that community aspect of the Common Voice community. So we often work with regional researchers who do a lot of work within their language communities and who have a real understanding of the different, community groups and community organisations, who have an interest in recording speech data.And one of the really interesting case studies there is from a region in Spain and northern Spain called Catalonia. And it's a really interesting intersection of language, culture, identity and independence. So for a long time – just like in Australia, where indigenous languages were marginalised and oppressed for a long time, and that's led to a break in generational transmission and colonial practices, which lead to linguistic and cultural erasure – in northern Spain,You know, there was a similar process for many years where the Catalan language, which is close to Spanish but not the same, was oppressed for many years, you know, it wasn't allowed to be spoken. It wasn't an official language of government. And so as part of the Catalan independence movement, we saw this incredible desire for people from Catalonia to record Catalan.And so in the metadata that we have around the Catalan dataset in Common Voice, we actually find that for Catalan, many of the speakers are much older, and we have a much better gender balance in the speakers, because the younger generation never learned to speak Catalan because of this process of colonisation and language oppression. And that's been a huge eye opener for me in terms of the links between language, culture and power.0:15:55 - 0:16:02
Sotirios Alpanis
Yeah. That's fascinating. Does this happen a lot? Do you sort of pick up these really interesting sort of human stories and factoids about, you know, language across the world?0:16:03 - 0:18:34
Kathy Reid
Oh, absolutely. Like, well, one of the things we asked people when they join the Common Voice project or Mozilla Data Collective, is what languages do you speak? Because people often come to us with, you know, fascination for languages and a fascination for cultures. And, you know, one of my colleagues, Professor Fran Tyers, who, I think he's up to 20 languages that he can speak fluently, just an incredible aura, Fran.So, yeah, often people come to us with a very strong languages, background. But there's lots of stories like that in the Common Voice project, you know, this is Common Voice as digital public infrastructure, as cultural infrastructure. Another really interesting story there is the story of Uyghur. So the Uyghur community is geographically in China. They are culturally and ethnically very different to many other Chinese groups.And the Uyghur language has gone through, again, very similar forces of constraint. And so the Uyghur language can be written in two different scripts. It can be written in Cyrillic script, and it can be written in Persio-Arabic script and different parts of the community have preferences for writing in different ways. And rather than having a particular linguistic ideology or a particular view of how Common Voice believes that Uyghur should be written,One of the things that we did to try and accommodate, you know, what is very natural language variation across communities is change the Common Voice platform so that you can have speech data that is recorded in multiple, writing systems, multiple orthographies. And so by making what is really a small technical change, we're better able to meet the needs of different subcommunities within a language community.And that's something we're really proud of as well. Like we didn't realise when we started the project just how intertwined language is with concepts of power, concepts of governance, concepts of diversity and equity across language communities. So yeah, it's been a real eye opener for me in terms of, you know, perhaps I was overly naive, but here I was thinking that, well, speech data is just, you know, signal processing and it has, you know, decibel readings.And then we align it with a transcript and it's data, you know, what is the politics in the culture and the power interplay here? And I think that's been a huge learning for me.0:18:34 - 0:18:41
Sotirios Alpanis
Speaking of messy data, what sort of opportunities do you see for applications of Common Voice in the GLAM sector?0:18:41 - 0:20:30
Kathy Reid
The first thing that I want to say about messy data and Common Voice and the GLAM sector is really about diversity, equity, and really about representation. And if I think about the work that many GLAM institutions do and what their mission alignment is, so they're often government funded. They operate in the public interest. That public is diverse and that public is broad, and that publicIn Australia speaks 250 languages, not just English. So I think that there's a lot of values alignment that I see between Common Voice and GLAM institutions. And I see a lot of the work that Common Voice does in creating digital public infrastructure and digital public goods also being part of the work that GLAM institutions do in creating digital public goods and digital public infrastructure.So if I step back to 2017 and I step back to what I see is the birth of the current era of artificial intelligence and machine learning, and I think about where the label data came from that goes into things like ChatGPT that labels images. And I think about the very well labeled text data for things like Wikipedia and Wikidata.A lot of the data that has gone into foundation models for artificial intelligence now actually came out of GLAM institutions. So GLAM institutions have done a lot of the hard work for AI and machine learning. And I'm not sure that they're getting a lot of the benefits that have come out of the AI and ML advancements that we're seeing.0:20:30 - 0:20:38
Sotirios Alpanis
I was wondering if there are any sort of existing case studies or partnerships for Mozilla Common Voice with GLAM or research institutes.0:20:38 - 0:22:55
Kathy Reid
With Common Voice, not specifically. So we have individual research agreements with individual language researchers. And one of my colleagues in this space is a person called me Meesum Alam. He does a lot of work with Pakistani languages. That's an incredibly, interesting use case because, if we think about how languages develop geographically, often what you find is that geographical boundaries, like mountains lead to linguistic isolation, because people can't get over the mountains to share languages.And we don't get like the languages mixing together. And there’s like a whole series or whole piece of research here around how the internet has allowed language spread and how language is flattening because everybody speaks the same sort of, you know, everybody speaks the same form of English or everybody speaks the same sort of language now, because the internet's flattened everything.Meesum’s work is fascinating because in Pakistan, especially in northern Pakistan and around the Afghan borders, Pakistan is all mountains: mountain, mountain, mountain, mountain. And so you go over one mountain and you drop into a different language, and you go over another mountain and you drop into, you know, a related and possibly, mutually intelligible, variant of the same language, but it's different enough.that speaker 1 might not be able to understand speaker 2. Anyway, Meesum’s work is incredibly fascinating because he's documenting and recording a lot of these, very, very, very diverse languages of Pakistan and Afghanistan, where we have only thousands of speakers of those languages. But often they don't speak any other language. And so there's sorts of questions around economic participation and societal participation, marginalisation in a technical society.So again, we get back to the concept of language is data, language is power, language is social inclusion, language is economic inclusion. And we get back to the, you know, language is power.0:22:55 - 0:23:07
Sotirios Alpanis
In terms of Mozilla Common Voice, you’ve already touched on a couple of the challenges, particularly around, you know, the biases towards English. I wonder if there's any other sort of broad topics around challenges you’ve encountered.0:23:07 - 0:28:45
Kathy Reid
So the way I'd like to frame this is not so much challenges, because I think framing it as challenges frames it as a problem. When I don't want to frame it as a problem, I want to frame it as something that is a natural occurrence. I don't want to problematise something that I don't see as a problem in language.We have an enormous amount of natural variation. So I'm a woman and you can probably hear that I'm a woman because I speak with a higher register than you, Sotirios. And so, you know, if I were a man, I would probably speak down here and I'd speak, you know, 20Hz lower. So there are gender differences in how we speak.There are accent differences in how we speak as well. So you can see, oh, you can't see. You can hear that I speak with an Australian accent because I say things like, vase to rhyme with Mars, whereas my American colleagues say vase to rhyme with ways. And so we have all of this natural variation in language. So we have that acoustic variation, the accent variation.But we also have lexical variation in how we speak as well. So as an Australian, it would be completely normal for me to say to one of my friends, yeah nah, there's been a bingle at Broady, the Western’s chokkas back to the servo, and I’m gonna be late for bevvies at Tommo’s. So that Australian slang and those Australian contractions are completely natural speaking in an Australian context.But my American and my British friends might say, Kathy, what on earth are you talking about? And so we have lexical variation in language as well. And so we also have variation over time. So if we listen to news broadcasts of the 1950s and 1960s; “welcome to the ABC news”, the accents differ. The wording that's used differs, even spelling changes over time.So it's like programme, double m e, versus program with one m. So we have all of these linguistic variation. We have speech variation. And we also have what I call time variation in language. And one of the, one of the things that we've had to work through is Common Voice, because when Common Voice is originally established, it recorded read speech or elicited speech.So you read a sentence, and because of the way the Common Voice is licensed, because it's digital public infrastructures and very openly licensed, all of the sentences are public domain as well. And one of the ways in which we get sentences that are licensed public domain is from projects that are out of copyright. So one of the biggest out of copyright projects is Project Gutenberg, which has like text works that are out of copyright – and copyright only lasts 70 years in Australia.And so you get things like tenements and buggy carts in the read speech sentences. You don't get things like, I'm running this on a GPU with 28 gigs of Ram, right. Because those things don't occur in Project Gutenberg because they weren’t invented yet and so we have a challenge of temporality with speech. So if I think about how Gen Z speaks, you know, like skibidi ohio rizz, anything that's out of copyright, that's going to have the phrase skibidiohio rizz for somebody to read that in Common Voice, right? So one of the ways that we've tackled this, instead of having read speech, we're now moving to spontaneous speech. So instead of asking people to read a sentence. We'll ask them to answer a question. And that question might be, you know, tell us, tell us about your shopping trip, or tell us about what you did with your friends on the weekend.And so we're eliciting a much more conversational, a much more naturalistic use of phrases and expressions. I don't know, somebody might use the words skibidi ohio rizz in their response. So that's the challenge. There's another, and sort of coming back to natural language variation, many languages exhibit what's called code switching.So code switching is where there are elements of two or more languages that are used at the same time. So for example in Singapore you will often hear English mixed with Malay or Tamil or various forms of Chinese because of the ethnic diversity in Singapore. And so we call that Singlish. But it's English which is code switched with many other languages.So for example, in Nairobi, in Kenya, you have a dialect called Sheng which is Kiswahili code switching with English, and in parts of Africa that were colonised by the French, you will have like in Rwanda, you'll have Kinyarwanda which code switches with French. And so you have French words dropped into Kinyarwanda.And so one of the challenges that we have is the machine learning and artificial intelligence. How do you train a model that understands that this might contain, like both Kiswahili and English? Or it might contain English that is Malay accented, that has Tamil words as well as Chinese. So that's that's also another example of very natural language variation.But how do we represent that variation in machine learning and artificial intelligence?0:28:45 - 0:28:54
Sotirios Alpanis
So you come along with maybe a more rigid metadata structure and say, you know, what language is this? And it's actually, well, it's not one language. It could be 2 or 3 in the same sentence.0:28:54 - 0:30:29
Kathy Reid
Yeah, exactly. And I'm reminded that, when I was doing my PhD, I did a, a stint for another large company, and I'm not going to name them in the podcast, but they're very big. And I was working in their speech recognition team, and we were collecting data for a speech recognition project, and somebody was giving me the list of requirements because it was my job to go and find the data.And they were listing the languages that they needed speech data for, you know, English, Japanese, Spanish. And I said, well, which Spanish do you want? And I got a bit of a blank look back. And I said, well, you know that Spanish is spoken in Spain, Castilian Spanish, but there’s huge variation between Barcelona in the north and Valencia in the south, there's a huge variation of Castilian Spanish.And then when we go over to Latin America, we've got Spanish speaking Mexico, we've got Spanish speaking Guatemala, which often code switches with Guarani languages. And then if you go to Miami, you'll Florida in the US, either the Texas or the southern U.S. states, you often have English, which is code switched with Southern Latin American Spanish as well.So there's there's many different forms of Spanish. And I think that was a huge eye opener for me in terms of the mental models that people have of languages as categories and the huge amount of human language variation that there is in the world. See, that's my which Spanish question that I keep coming back to.0:30:29 - 0:30:35
Sotirios Alpanis
So that feels like a good, good opportunity to take a bit more into your background and your work experience.0:30:35 - 0:33:34
Kathy Reid
So there were two things I was very good at when I was in high school. I was very good at computers, and I was very good at languages. And, well, if I go back even further, I wanted to be an astronaut when I was a little girl. So I was obsessed with space. And this is like the 1980s.And and then Challenger happened and like, you know, maybe not a good longevity career option there. So I ended up being very good at computers and very good at languages in high school. And when I got to university, I didn't want one or the other. And I ended up getting into a double degree program at Deakin in Geelong, and I was able to do both information systems and get an arts degree in languages.And, as things transpired, I ended up not working in languages for a long time. I ended up working in IT for a university for many, many years. But then I left the university and I ended up working for a voice assistant company. And now you can start to see the languages and the technology coming together. And that was a privacy focused voice assistant company.That operated in different languages. And so I was drawing on my language skills again and my understanding of how linguistics and languages work. And then I was actually here in Melbourne in 2018 at a conference called DDD Melbourne, which is a developer conference. And Genevieve Bell had just come back from Silicon Valley, back from Intel, and two of my friends, Sae Ra Germaine and Cameron Tudball were with me at the conference.And I said, look, Genevieve Bell’s doing this new master's program, and you should totally apply. And I'm like, I'll never get in, like, I'm not going to apply. And so Sae Ra and Cameron being incredible the humans they are said, right, we dare you to apply. And so yeah. How do I say no to a dare. You know your friends dare you, it's an Australian thing to do.And so I applied to ANU at the time it was the 3Ai Institute's Masters in Applied Cybernetics program on a dare and ended up getting through the first round. Ah, I fluked this. Ended up getting through the second round –maybe I'm in with a chance. And I got out of the interview, the third round interview, and the penny dropped that.That wasn't a selection interview. It was a confirmation interview. And about ten minutes later, I picked up the phone and it was Genevieve Bell on the other end and she said, Kathy! Genevieve, how are you? She said, need you to move to Canberra, you're starting a new master's course next year. And so I packed up from Geelong, and I headed to Canberra for what I thought would be a year.Did well in the master's program. It was a, you know, incredible opportunity going into the PhD program from that, and ended up doing a PhD in voice data and speech data and voice data bias. So, that's how two very disparate threads of language and technology have resulted in me working for Mozilla.0:33:34 - 0:34:05
Sotirios Alpanis
I don't think I've ever heard of anyone doing a Ph.D basically on a dare. Well, the consequences of a dare. What struck me there was, and it's something I've experienced myself, and talked to other people about is that sort of false dichotomy, I guess, between arts and science that’s forced on you from a very young age, and it kind of reminded me of, you know, when you're talking about, you know, which Spanish.It's a really, yeah, it's a false way of codifying things. And actually, it's just great that you've been able to combine those two interests, in your career.0:34:05 - 0:35:28
Kathy Reid
I couldn't agree more. I think if we think about the problems that we're solving in the world today, if we think about problems that are wicked problems, that are intractable, they're problems of systems, they’re problems of ecosystems, they’re problems of collective action, they’re problems of getting many people with many different opinions and many different viewpoints to work constructively together for a shared objective.There's a beautiful quote by Norbert Wiener. He's part of the foundation of the cybernetics discipline. And when he was talking about cybernetics in the 1960s, you know, he was saying, the poets shall have to become engineers or the engineers shall have to become poets. And that's something that I reflect on quite a lot. I think we impose false dichotomies on many categories in the world because it helps us deal with them individually, whereas if we were able to break down some of those boundaries and work across those categories, I think we would see a lot more situations where the whole was greater than the sum of the parts.It's not arts or science, it's arts and science.0:35:28 - 0:35:31
Sotirios Alpanis
You tell us a little bit about Mozilla Data Collective, who you work for currently.0:35:32 - 0:38:51
Kathy Reid
So I'm currently the Head of AI and Machine Learning at Mozilla Data Collective. And the Mozilla Data Collective was really born out of some of the challenges that we saw around data being used for AI and machine learning in ways that weren't true to the values of what Mozilla stood for and weren't true to the values of what we saw was collective action and digital public infrastructure.One of the challenges that we have with the data that's being used to build machine learning models is that it's been scraped from everywhere, so if we think about how ChatGPT or Claude or Gemini has been trained, the data for that training is enormous, absolutely enormous. And so we measure training data for machine learning and artificial intelligence in a unit called tokens.And tokens are basically just bits of chopped up words. And so when people train things like Claude or GPT, we’re now at a scale where it takes trillions of tokens, so that's, you know, trillions of web pages, trillions of books, trillions of pieces of content, and all of that content comes from somewhere. And what we're finding at the moment is that that content is getting scraped without permission, without consent, and more importantly, without any of the benefit or any of the value that has gone into creating that data and going back to the organisations or the communities or the people who created that data. And with Mozilla Data Collective.We wanted to try and create a platform for ethical data sharing and a platform that created better value exchange. So some organisations may have a mission or may have an objective to freely share the data that they create. That might be part of the mission and the objective. That's absolutely fair and reasonable. But there are also organisations who create a lot of data in the process of their of their business activities, where that data really should be a revenue generation piece for the organisation.And in particular, we saw media organisations who have a lot of content online, but they weren't generating any revenue from having that content online. That content was getting scraped to feed artificial intelligence, and they were also losing traffic that was sort of being captured by various different search engines as well. So we saw some market changes that led us to think, well, is there a way to have fairer value exchange in the data market, AI, machine learning, you know, data that is consented to data that is collected ethically, tied to the returns, benefits to the people in the organisations who create that data.And so that's really the driver for the Mozilla Data Collective. It's an evolution of Common Voice. All of the Common Voice data is now held on the Mozilla Data Collective, but it's also an attempt to try and overcome some of the challenges that we saw in the data infrastructure, in the data infrastructure ecosystem.0:38:51 - 0:39:03
Sotirios Alpanis
One small thing you dropped in there is your resistance to using the more popular LLMs. I was just wondering, you know, do you have alternatives and how do you how do you make those decisions and how do you think about them?0:39:03 - 0:44:02
Kathy Reid
The way that I'd like to answer this question is coming back to conscientious consumption. I think nowadays very few of us would go and buy a leather jacket or go and buy, say, a fur coat without understanding where the materials for that item came from, whether we were happy that the production of that material was sustainable and sourced ethically.It's like having Fairtrade cocoa in your chocolate. It's like having Fairtrade coffee beans. When I think about conscientious consumption in the AI space, I think there's still a gap in understanding around what does conscientious consumption mean when I'm typing a query into Claude or into ChatGPT or into Gemini around how the model underneath is being built, how it's being powered.I think there's all sorts of conscientious consumption pieces to unpack there for example, where did the data come from that was used to build the model? Was that sourced ethically? Was it sourced sustainably? Who benefited from that data? And if we think about some of the news stories that have become headlines with at least one of the organisations in question, I won't name them specifically so that we stay very neatly away from any legal issues.But we've seen large big tech companies buy up books, scan them, ingest them into their models because this is all legal, because our copyright hasn't really caught up for AI and machine learning and then just dispose of those books. But the authors don't get any revenue, or the authors don't get any royalties. When a query is answered that uses data that's taken from their book, we don't have a good understanding of conscientious consumption means for AI and ML models.So that's the commercial models. And if we think about open source models, there are still challenges and concerns there as well. So one of the big datasets that's used to build open source large language models is Common Crawl. So Common Crawl is a is a huge dataset that is a crawl, essentially a scraping of the entire World Wide Web.It's intended not to be... it's not for profit. So Common Crawl doesn't generate a profit from crawling and creating the Common Crawl dataset at the sites that are scraped with Common Crawl still don't get any revenue or any benefit from having been in Common Crawl. So if then a model is trained on Common Crawl like Mistral is trained on Common Crawl and you use Mistral locally on your laptop, you install it and configure the GPU and drink lots of coffee while doing that.There's still a conscientious consumption gap there because there's still no value exchange for the people whose data went into Common Crawl. So I can see both sides of the equation here. The people who see Common Crawl as digital public infrastructure building an alternative to commercial models that people might be very heavily dependent on. But the people whose data is in Common Crawl, I think, also have a fair point around well we’re not getting any value out of the Common Crawl.And so as a response to that, we've seen a Common Crawl scraper be blocked from a lot of websites, which changes the composition of what's in Common Crawl. So we're starting to see people block their websites from crawlers, and we're starting to see the use of network technology. Like now when you go on many websites, you have to check a box to say, I'm human.And that technology is actually trying to keep out the web scrapers so that the content on that website isn't scraped and doesn't end up in datasets for training AI and and machine learning models. So I think there's a lot of nuance here. I think it's a very... I think there's many different issues to consider and conscientious consumption for LLMs.And I think, well, I think what we need to be doing – this as Kathy speaking; Mozilla obviously doesn't have an official position on this because it's Australian and Australian space – I think we've seen a lot of backwards and forwards around regulation of AI and machine learning in Australia. So originally we're going to have some stronger legislation that legislated a lot of data provenance and a lot of sort of dataset supply chain.That legislation has been softened. That's not going to go through anymore. We've got voluntary AI guardrails in Australia, but they're voluntary and they're guardrails. They're not legislation and they're not law. And no one can be sort of held to account because they're voluntary guardrails. And so I think we're missing an opportunity for legislation which would help bring to the fore some of the conscientious consumption pieces around AI and machine learning, and where the data that feeds them comes from.0:44:02 - 0:44:10
Sotirios Alpanis
I’m struck that maybe Mozilla Data Collective is, you know, it's an attempt to sort of to clarify, exactly where the data is coming from.0:44:10 - 0:46:24
Kathy Reid
Absolutely. Let's say you and I, we had a coffee this morning when we caught up, if we wanted to, we could ask some really pointed questions about where do the coffee beans come from. Where did the soy milk come from? In Melbourne, where did the water come from? And we would have reasonably clear answers to each of those questions.Like the soybeans was sourced here, the water came from you know, Melbourne water, the coffee beans came from, you know, wherever they came from. And we would get reasonably clear answers to that supply chain. If I ask where does the data come from, you know, a popular LLM like ChatGPT or Claude or Mistral, it's much more opaque. And what transformation did the data go through?Who consented to giving that data? Where did the value of the data go through? So I think if we start to think about data as a supply chain, it helps us to think about how we might regulate the dataset supply chain. And when I start to think about some of the standards that are coming onto the scene.So there was a new artificial intelligence standard – because I’m a nerd, I like reading artificial intelligence standards when I'm not, you know, digging into metadata.Yeah, I like that, I'm in a safe space of talking about artificial intelligence standards. So there's a new standard called AS42001. And it's the international artificial intelligence standard. I won’t put everyone to sleep by talking about it in detail. But a large part of this standard is really about data provenance and dataset supply chain.Where do the data that you're using in your artificial intelligence models come from. And I think that's going to be something we're going to have to think about at a much broader scale in terms of regulation and compliance and governance, particularly for large organisations who might be using multiple different LLMs or multiple different artificial intelligence tools in their suite of operations.How do you track that? You know, you don't have a supply chain risk with using those models.0:46:25 - 0:46:29
Sotirios Alpanis
What's coming up next for Mozilla Data Collective and what does the future look like?0:46:29 - 0:49:06
Kathy Reid
One of the big things that we really want to do is provide the ability for people to create a revenue stream from their data. And so, within the next month or two, we will be rolling out payments so people will be able to host a dataset on the Mozilla Data Collective platform, and they'll be able to charge money for that dataset.Obviously, the larger the dataset, the more niche it is, the higher the quality, the more they'll be able to charge for that dataset. But we want to be able to help people create a revenue stream from their data. And some of the biggest interest we've seen has been from cultural institutions. So cultural institutions, both in Australia but also in Europe and the United States, are facing two very different pressures.So they're facing the pressure to maintain their cultural mission, which might be to share knowledge openly or to, you know, preserve a cultural heritage record, depending on the type of institution, whether it's a gallery, a library, archival museum. But by and large, they're also facing incredible funding pressures. So, you know, with the general economic climate, we're seeing the retraction of funding from cultural institutions and the retraction of the ability to deliver on that mission.And so we've got a lot of interest from cultural institutions who are looking to generate revenue from the cultural data that they hold in order to preserve and protect and uphold the mission, you know, their public public institution mission. And moreover, what we're hearing from a lot of cultural institutions, particularly those that have large repositories of data, is that they're facing a massive infrastructure risk.So one of the challenges that is presented by large data scraping and unauthorised data scraping on the web is, if you have a web server, if you have like a data repository on the web, what's happening at the moment is that there are so many scrapers and bots who want to suck all that data out for machine learning.People are spending all their time trying to stop the bots and scrapers. And so this is added workload of having to try and stop the bots and scrapers from getting to the data that it's your mission to create and to protect. It's like not only are they scraping the data, they're creating more work for people in trying to stop the scrapers.So it's this double whammy that cultural institutions are facing.0:49:06 - 0:49:29
Sotirios Alpanis
That this may be me projecting, but one thing that struck me with having followed Common Voice and Data Collective over the past year or so is that it's a nice contrast with a lot of modern technology, where the releases seem a lot more considered and it sounds like a value judgment, but at a slower pace. And I just wonder, you know, is that part of how you develop things?0:49:29 - 0:52:14
Kathy Reid
It's great observation. What we try to do with both Common Voice and the Mozilla Data Collective is be very considered and to be very values aligned. So for example, with Common Voice, we do a release, a dataset release every three months. That's a trade off between, you know, the infrastructure time it takes to do a release of data for 350 languages.It's still, you know, we've automated a lot of it, but there's still a lot of work involved in doing that. But it's also about making sure that there's a significant improvement in every three months, and not every language will have a lot more data every three months. But for many languages they've got set up on the platform, they've got sentences, they've got data in the platform.And so they're really eager to get like particularly for new languages, they're really eager to get their first release. So we wanted like a cadence that supported that. And I've had the honor of working with a partner in the APNIC Foundation. So APNIC does a lot of the low level networking infrastructure in the Asia-Pacific region. And part of their foundation work is getting better participation across technology in Asia-Pacific languages.And so we've worked with APNIC Foundation to bring Sundanese and Javanese, which are both languages spoken in the island of Java in Indonesia. And we're bringing those languages along the platform for the first time. And, you know, they're going to get released in this release. That's incredibly exciting. There is also some other, what I'd call values considered pieces in the Mozilla Data Collective and how we think about data.And one of the other pieces that we do here is something that we call slow consent. So in an era where you have to check terms and services and you have to check boxes to get anywhere on the internet. We've created, when I say we, like our techno social systems have created a default where we just tick and we don't think about the consequences of the terms and conditions.And so we wanted to be really deliberate about what we call slow consent. We want people to consider when you're signing up to download a dataset, here are the terms and conditions and the constrictions and the restrictions on using that data. And we want people to be really comfortable that those are the restrictions that you're agreeing to when you choose to download a dataset.So that's what we think about when we think about slow consent and making sure that people are thinking deeply about the data that they're downloading. They're not just set and forget, you know, ending up with a hard drive to to then train a model on.0:52:14 - 0:52:23
Sotirios Alpanis
I wondered if you could talk a bit about Mozilla's presence in Australia and how that impacts your work life and what your typical workday looks like.0:52:23 - 0:54:02
Kathy Reid
So, I can't speak for other areas of Mozilla, I can only speak to the area of Mozilla that I work in. And I've met a lot of my Mozilla Australian colleagues, and there's quite a few in Melbourne, and we catch up with coffee and they’re fantastic. Hi, Paul. Hi, Reem. So one of the things that I do, which is perhaps a little bit unusual for the people who work in Australia, I work an altered work pattern.So I start my workday about 6:00 at night, and I go through to about 2 or 3:00 in the morning, which works perfectly for me because I'm a night owl and I don't do mornings. I know that it wouldn't suit everybody, but it's sort of the best of both worlds. So I get to do the stuff I need to do during the day, and some of my family commitments.But then I'm able to work for a multinational organisation remotely. So I'm based in Geelong. And when I think about the world of work in AI, and the world of work in the knowledge economy of, you know, the 2030s, we're going to be in the 2030s in about three years time. We need to be thinking much more broadly and much more flexibly about how we bring great people together to do great things, you know, facilitated by the internet.I remember being in high school in the early 1990s and thinking, one day I'm going to log in from a terminal, and I'll be working with people in the States, and I won't have to get on a plane. And I won't have to travel for 26 hours to get somewhere. I'll be able to work with some of the, you know, the brightest minds in the country.And that's what I’ve been able to do, working remotely with Mozilla.

Resources

Title	Type	Author(s)	Tags
Recording and transcript – Kathy Reid's 2026 Technologist Talk	documentation	Kathy Reid->	Open access Tech Machine learning Collections