Justin
UX Researcher & Designer

Mozilla Common Voice

UX Research: Understanding biases in artificial intelligent and voice datasets.

Developing artificial intelligence without biases is important for product development, diversity, inclusion, and equality in machine learning models and artificial intelligence applications.

Mozilla Common Voice is an open source database built with Deep Speech that trains machine models for conversational AI and digital assistants. 

This project involves real people that either record themselves speaking a prompt from their browser or listen to and verify recorded prompts from their browser.  My focus on this project is to research how users listen to, judge, and either accept or reject a recorded prompt.  

Role

One of two User Experience Researchers.

Project

Research how users listen to, judge, and either accept or reject a recorded prompt on Common Voice.

Duration

May 2022, 4 weeks total

Tools

Recorder, Computer, Mozilla Forum, Mozilla Common Voice, Speakers

UX Research Activities
  • Usability Tests

  • Heuristic Evaluations

  • Literature Reviews

  • Semi-Structured Interviews

  • Contextual Inquiries

  • Task Analysis

Initial Questions
  • How do people judge, deny, or accept spoken prompts on Mozilla Common Voice? 

  • Are there any inherent effects of individual bias in judging these prompts?

Importance of non-biased AI and Voice Technology

Tomatoe-Tomato. Language is diverse and people have biases.  Specifically, people make judgements based on accents, dialects, semantics, and syntax.  These judgements and biases should NOT be inherent in artificial intelligence.

Therefore, developing conversational artificial intelligence without biases is important for product develop, diversity, inclusion, and equality in machine learning and artificial intelligence.  

Speech Recognition

“Certain words mean certain things when certain bodies say them, and these [speech] recognition systems really don't account for a lot of that.” 

— Safiya Noble, Associate Professor in the Departments of Information Studies and African American Studies at UCLA and author of best-selling title Algorithms of Oppression

Linguistic Diversity

“What language hierarchies are we reinforcing if we don’t design them for linguistic diversity?” 

—Hillary Juma, Common Voice’s Community Manager

Race Gap

“[A study found that a] “race gap” was just as large when comparing the identical phrases uttered by both black and white people. This indicates that the problem lies in the way the systems are trained to recognize sound.”

—New York Times, There Is a Racial Divide in Speech-Recognition Systems, Researchers Say

Research Objective

Discover how people contribute to, judge, deny or accept spoken prompts on Mozilla’s Common Voice through contextual inquiries

Participant Demographics
  • Six Interviews (3 Women, 3 Men) between the ages of 18 and 32.

    • Languages (4 native English speakers, one native Farsi speaker, one native Chinese speaker)

    • Occupations (Ph.D Candidate, IT Professional, High School Student, Natural Language Processing Scientist, UX Designer)

    • Ethnicity (3 white, 1 Persian, 1 Asian American, 1 Asian)

Research Methods
Interviewee on the left and interviewer on the right.  I was directing the interview while the other researcher took notes.

Interviewee on the left and interviewer on the right. I was directing the interview while the other researcher took notes.

Contextual Inquiry

  • Ask the participants to look at Common Voice and interact with the service.

  • Observe how participants either accept or reject audio recordings.

  • Discover what usability issues are there with the Common Voice service?

Semi-Structured Interview Prompts

  1. Search for common voice on common search engine.

  2. What are your initial impressions of common voice?

  3. What would you want to do first? (thoughts/feelings/reactions)

  4. Speak/listen to at least 5

  5. Navigate to contribution criteria (What would you do differently after reading the contribution criteria)

  6. Would you sign up for an account? - Why?

  7. Log in to Common Voice, show dashboard, ask for feedback, reaction?

Interviews lasted for ~45 minutes. Each interview was recorded with an iPhone and analyzed thereafter. Ask me about the specific results!

Participant Results
  1. Search for common voice on common search engine.

    1. Five out of six participants found Common Voice through a popular search engine. The participant who didn't find Common Voice thought the researcher said "Mozzarella" instead of "Mozilla".

  2. What are your initial impressions of common voice?

    1. “Beautiful interface”;

    2. “The site is fairly ‘self-explanatory’”;

    3. “Makes me feel like they are trying to use me to make robots more human [...]”;

    4. A good way to do “curation”: two parts of the website, data validation and data collection;

    5. "Teaching machines how real people speak: accents, gender (maybe), tone (emotion);

    6. “[Anticipates] to hear voices from around the world”

  3. What would you want to do first? (thoughts/feelings/reactions)

    • 1/6 navigated to the about us page

    • 5/6 clicked the listen or speak to contribute recorded voice

    • 1/6 viewed Datasets

    • 5/6 were drawn to the listen section most because it felt more “bold” and because it is green

    • 1/6 drawn immediately to the speak and listen section because "they like to hear themselves talk."

  4. Speak/listen to at least 5

    1. Ask me about this section! It's quite lengthy :)

  5. Navigate to contribution criteria (What would you do differently after reading the contribution criteria)

    1. Five out of six participants were not aware of contribution criteria standards and did not scroll down to read the remaining criteria (design recommendation!)

  6. Would you sign up for an account? - Why?

    1. 100% of participants would not sign up for an account.

      1. Why? Trust, transparency were leading factors.

  7. Log in to Common Voice, show dashboard, ask for feedback, reaction?

    1. Gamification position of dashboard was is enticing to all participants

    2. All participants were more inclined to sign up if dashboard information was known beforehand.

    3. Design recommendation from participants is that they would like to see personal statistics of recorded voices. e.g. how many accepted, denied

Research Results
  • The main research result is that people have biases and judge accent and dialects of non-native speakers. Not only that, we had some participants deny spoken prompts based on where someone sounded like they were from another region of the same country (i.e. south China versus north China).

  • Additionally, our research led to other interesting design implications for other aspects of the service that could affect the AI’s data set such as trustworthiness and use of data collection, market exposure, visible and upfront Contribution Criteria.

Design Opportunities

Trustworthiness and use of data collection

  • Transparency about the system, real world applications, who uses the data collection

  • More mention of privacy and importance of diversity in language

  • Add articles from academic researchers and industry reports like NY Times or The Economist

Market exposure

  • Advertisements on Reddit and other social media platforms

Visible and upfront Contribution Criteria 

  • Menu navigation on contribution criteria

  • Add contribution criteria questionnaire to recordings that are denied to help with Machine Learning

  • Potential tutorial for new users to quickly overview the criteria

  • Navigation indicator (e.g. pagination or table of contents) so users either scroll beyond the first criteria or can tab through.

  • Video tutorial, interface walkthrough, recorded voice examples (accept vs reject)

Product expansion into educational platforms for second language learners

  • Listen to native speakers (user who are verified fluent in native language)

  • Include statistics (accept and denied prompts)

  • Gamification (accomplishments, goals, and language proficiency)

Contribution Criteria UI Recommendation
Above image is a quick low-high fidelity of Common Voice's Contribution Criteria.

Above image is a quick low-high fidelity of Common Voice's Contribution Criteria.

Minor UI adjustments for readability and information architecture

  • Scrolling errors and complications in multiple places: Home, About, Contribution criteria.

  • Add navigation to the contribution criteria for easier navigation and more unbiased judgement of accents and dialects of recorded voices.

Limitations
  • Time (more UX Design iteration and experimental design with 30 to 40 participants).

    • Six Participants is a small number for this type of research. With more time and effort, I would like to run a small experimental study with 40 plus prompts that are similar with 20 prompts for reading and 20 for recording. From here I would understand biased judgements more accurately with quantifiable data.

Questions?

If you have questions about my research, please reach out to me.

Mozilla is a non-profit

Please support Mozilla by any possible means.