Research Paper Published: 'Our Collective Voices' - The Social and Technical Values of StammerTalk Dataset
We are excited to announce the publication of our research paper “Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset” at the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘25)! 🎉
A Collaborative Achievement
We are deeply grateful to our partners at AImpower.org for their incredible collaboration on this research. The first and second authors, Jingjin Li and Qisheng Li, are from AImpower, while the third and fourth authors, Rong Gong and Lezhi Wang, are from the StammerTalk community. This partnership exemplifies the power of community-driven research and inclusive AI development.
Key Findings
The research demonstrates how grassroots, community-led data efforts can expose and rectify fluency bias in speech AI systems while fostering self-advocacy and community-building. Our fine-tuned ASR models showed substantial improvements:
- Mild stuttering: Error rate reduced from 16.34% to 5.8%
- Moderate stuttering: Error rate dropped from 21.72% to 9.03%
- Severe stuttering: Error rate decreased from 49.24% to 20.46%
The paper also reveals significant social challenges faced by people who stutter in China, including stigma, workplace discrimination, and limited professional support access.
For citations, please use:
ACM Format:
Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. 2025. Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). Association for Computing Machinery, New York, NY, USA, 2768–2783. https://doi.org/10.1145/3715275.3732179
Access the Paper: https://dl.acm.org/doi/10.1145/3715275.3732179
This milestone demonstrates how community-led research can produce both technical innovation and meaningful social impact. Thank you to all participants, the StammerTalk community, and especially our collaborators at AImpower.org who helped bring this research to the world stage! 💪
This work was supported by NSF Award #2427710 and the Patrick J. McGovern Foundation.
StammerTalk Dataset Now Available on Hugging Face
We are thrilled to announce that the StammerTalk Mandarin Stuttered Speech Dataset is now publicly available on Hugging Face! 🎉
Dataset Overview
The StammerTalk dataset represents a significant milestone in stuttering research, containing 43 hours of spontaneous conversations and reading of voice commands by 64 Mandarin Chinese speakers who stutter. This comprehensive dataset includes both unscripted conversations and dictation of 200 voice commands, providing valuable resources for automatic speech recognition and stuttering event detection research.
Note: This publicly available dataset is a subset of the complete AS-70 dataset, as it includes only the data from participants who provided consent for public sharing. The full AS-70 dataset contains additional recordings that remain private due to differentiating consent agreements with participants.
A Heartfelt Thank You to AImpower.org
We extend our deepest gratitude to AImpower.org for their incredible partnership in making this dataset publicly available. Their dedication to advancing AI research for social good has been instrumental in bringing this resource to the global research community. The collaborative effort between StammerTalk and AImpower.org demonstrates the power of community-driven research initiatives.
What Makes This Dataset Special
- Authentic voices: Speech data collected by StammerTalk volunteers who also stutter, creating a comfortable and understanding environment for participants
- Comprehensive annotations: Verbatim transcriptions with five distinct stuttering event annotations embedded in markups
- Community-driven: Created by the StammerTalk (口吃说) community at stammertalk.net
- Research-ready: Professional transcription and annotation, reviewed by StammerTalk volunteers
Access the Dataset
The dataset is now available at:
https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech
This release marks an important step forward in making stuttering research more accessible and inclusive. We hope this dataset will enable researchers worldwide to develop better tools and technologies that support the stuttering community.
Thank you to all the participants who generously shared their voices, the StammerTalk volunteers who conducted the data collection, and AImpower.org for their unwavering support in making this dataset available to the world. Together, we’re building a more inclusive future for speech technology! 💪
Participation of Interspeech2024
Rong presented our dataset work “AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection” on Interspeech 2024 in Kos, Greece. He was very happy to meet our co-authors Shaomei Wu from AImpower and Hongfei Xue from ASLP lab of Northwestern Polytechnical University. It was a very nice experience, and we hope to see you guys soon!

AIShell-Stammertalk Mandarin Stuttered speech dataset open for request download
The dataset download request can be applied by clicking this link. For detailed information about the dataset, please visit the AIShell-Stammertalk dataset page.
AIShell-Stammertalk Mandarin stuttered speech dataset
Dataset paper published! For details please check our Projects page.