Tone-preserving Igbo text-to-speech via transfer learning
IkengaTTS is a text-to-speech system for Igbo, a tonal language spoken by 30-45 million people in Nigeria and its diaspora. We fine-tuned F5-TTS (335M parameters) on 112,521 Igbo clips (255 hours) from the African Voices dataset. The result is the first open, evaluated Igbo TTS system with tone-aware text processing.
Multiple from-scratch approaches failed—autoregressive collapse, non-autoregressive silence, duration compression. Transfer learning from a large English-trained model was the key that worked.
Igbo is a register tone language. The same sequence of consonants and vowels can mean completely different things depending on pitch. A TTS system that ignores tone marks doesn't just sound wrong—it says the wrong word.
The base F5-TTS model (trained on English) cannot speak Igbo at all. After fine-tuning on African Voices data, the model produces intelligible Igbo speech. Listen to the difference:
The model clones the voice of a reference speaker. Here are samples across different speakers from the eval set, showing that the model generalizes across voices, genders, and speaking styles.
The same text synthesized at different training checkpoints. The model goes from noise to English-accented babble to intelligible Igbo over ~210K gradient updates (~8 epochs). Training loss plateaus early (~0.69 by epoch 2) but perceptual quality keeps improving.
Because IkengaTTS is fine-tuned from an English-trained base, it retains some English speech capability. It can also handle Igbo-English code-switching, common in everyday Nigerian speech. Prosody in mixed-language sentences is a known limitation.
Evaluated on 987 held-out samples from the African Voices dev_test set (sampled from 6,937 clips). We measure ASR-transcribability (WER via MMS), speaker similarity (WavLM cosine), and predicted speech quality (UTMOS). The key metric is WER delta (generated minus ground truth), which controls for the ASR model's baseline errors on Igbo.
| Model | MMS WER | ΔWER | SIM-o | UTMOS |
|---|---|---|---|---|
| Ground truth recordings | 55.6% | — | — | 1.28 |
| IkengaTTS (ours) | 50.3% | −5.3 pp | 0.963 | 1.55 |
| Base F5-TTS (no fine-tuning) | 100.5% | +44.9 pp | 0.958 | 1.55 |
Key finding: Our fine-tuned model is more transcribable by MMS than the original recordings (WER 50.3% vs 55.6%), likely because synthesis regularizes pronunciation. The base F5-TTS model completely fails (WER >100%), confirming that fine-tuning is necessary. All differences are statistically significant (Wilcoxon signed-rank, p < 10−159).
Caveat: UTMOS is trained on English and unvalidated for Igbo. Human listening tests with native Igbo speakers are essential and planned as future work.
IkengaTTS fine-tunes the full 335M-parameter F5-TTS model on Igbo data. F5-TTS uses flow-matching (a type of diffusion) with a Diffusion Transformer (DiT) backbone. It takes a reference audio clip and target text, then generates speech that matches the reference speaker's voice while saying the target text.
Text processing: We use character-level tokenization that preserves Igbo diacritics and tone marks. The vocabulary was extended from 2,546 to 2,604 characters to cover Igbo-specific characters (ị, ọ, ụ, ñ, and their toned variants).
Training: 8 epochs on 4×H100 GPUs, ~15 hours total. Loss plateaus early but perceptual quality continues improving through later epochs.
What failed: Before this approach, we tried Kokoro (AR collapse on 934 clips), FastSpeech2 at 25M and 1.1M params (silent output), and VITS from scratch on 112K clips (recognizable speech but 40-55% duration compression). Transfer learning was the only approach that produced usable results.