African Language Infrastructure

Building Open Language Infrastructure
for African Languages

While our long-term vision is synthetic intelligence, we are currently working on critical infrastructure for African languages. We believe these technologies must be built with communities, not extracted from them.

πŸ‡·πŸ‡ΌπŸ‡ΏπŸ‡¦πŸ‡ΈπŸ‡Ώ
Flagship Project

siSwatiVoice (2026)

We are collaborating with partners across Rwanda, South Africa, and Eswatini to build a large-scale, open speech dataset for siSwatiβ€”a language spoken by millions but dramatically underrepresented in AI systems.

siSwati (also known as Swazi) is a Bantu language of the Nguni group, spoken primarily in Eswatini and South Africa. Despite being a national language with millions of speakers, it remains severely underrepresented in speech recognition technology.

"We don't just collect dataβ€”we build capacity. Local researchers and communities are partners in every stage of our work."

150
Hours of Speech
80+
Native Speakers
3
Countries
100%
Open Access

Dataset Features

  • Diverse, high-quality speech recordings across multiple dialects
  • Speakers from multiple regions ensuring representation
  • Open access via Zenodo, Hugging Face, and GitHub
  • Full reproducibility documentation
Our Philosophy

Open Data Commitment

Open Datasets

All speech datasets released under permissive licenses. Free to use for research, education, and commercial applications.

Open Code

Training scripts, evaluation frameworks, and preprocessing tools available on GitHub. Fully documented and reproducible.

Open Benchmarks

Standardized evaluation metrics for African language ASR. Enabling fair comparison and progress tracking.

"All datasets, code, and benchmarks we create will be released openly under permissive licenses. We believe African language technologies must be built with communities, not extracted from them."

Roadmap

Future Work

ASR Benchmarks

Standardized automatic speech recognition benchmarks for African languages. Enabling researchers worldwide to measure progress and compare models.

Coming 2026

Evaluation Frameworks

Comprehensive evaluation tools that account for linguistic diversity, dialectal variation, and real-world usage patterns.

Coming 2026

Model Fine-tuning

Fine-tuned ASR models for under-resourced African languages, built on open foundations and freely available.

Coming 2027

Additional Languages

Expanding beyond siSwati to other under-resourced African languages, prioritized by community needs and partnership opportunities.

Ongoing
Collaboration

Our Partners

Interested in partnering with us? Get in touch β†’

Stay Updated on Our Progress

Follow our journey as we build open infrastructure for African languages. We share updates on datasets, benchmarks, and research findings.