IkengaTTS

Tone-preserving Igbo text-to-speech via transfer learning

Emmanuel Chimezie · Mexkoy Labs

Paper (PDF) F5-TTS · 335M params 112K clips · 255 hours

What is this?

IkengaTTS is a text-to-speech system for Igbo, a tonal language spoken by 30-45 million people in Nigeria and its diaspora. We fine-tuned F5-TTS (335M parameters) on 112,521 Igbo clips (255 hours) from the African Voices dataset. The result is the first open, evaluated Igbo TTS system with tone-aware text processing.

Multiple from-scratch approaches failed—autoregressive collapse, non-autoregressive silence, duration compression. Transfer learning from a large English-trained model was the key that worked.

335M
parameters
112K
training clips
255
hours of speech
987
eval samples

Why tone matters

Igbo is a register tone language. The same sequence of consonants and vowels can mean completely different things depending on pitch. A TTS system that ignores tone marks doesn't just sound wrong—it says the wrong word.

The word "akwa" has four meanings:

ákwá
"cry"
high-high
àkwà
"cloth"
low-low
ákwà
"bed"
high-low
àkwá
"egg"
low-high

Base model vs. fine-tuned

The base F5-TTS model (trained on English) cannot speak Igbo at all. After fine-tuning on African Voices data, the model produces intelligible Igbo speech. Listen to the difference:

Female voice (IECT1F80 — top-ranked speaker)

Base F5-TTS (no fine-tuning)
Attempts Igbo text — unintelligible output
IkengaTTS (fine-tuned)
ΔWER −12.3 pp, SIM-o 0.977 (best female speaker)

Male voice (IECT1M41 — best male UTMOS)

Base F5-TTS (no fine-tuning)
Base model attempting Igbo
IkengaTTS (fine-tuned)
ΔWER −5.7 pp, SIM-o 0.976, UTMOS 2.53 (highest quality male)

Speaker diversity

The model clones the voice of a reference speaker. Here are samples across different speakers from the eval set, showing that the model generalizes across voices, genders, and speaking styles.

Female speakers

IECT1F80 — Rank #1 (ΔWER −12.3 pp, n=75)
IECT2F16 — Rank #4 (ΔWER −10.1 pp, n=64)
IECT2F60 — Rank #7 (ΔWER −7.8 pp, n=98)
Female — long-form narration

Male speakers

IECT1M25 — Rank #2 (ΔWER −11.4 pp, n=14)
IECT1M41 — Rank #10, best UTMOS 2.53 (n=19)
IECT1M62 — Rank #13 (n=13)
Male — news style narration

Training progression

The same text synthesized at different training checkpoints. The model goes from noise to English-accented babble to intelligible Igbo over ~210K gradient updates (~8 epochs). Training loss plateaus early (~0.69 by epoch 2) but perceptual quality keeps improving.

500 updates
Noise
1K updates
Noise with rhythm
5K updates
Speech-like
10K updates
English accent
25K updates
Igbo emerging
50K updates
Recognizable Igbo
85K updates
Clearer diction
120K updates
Natural prosody
150K updates
Refined
210K updates
Final checkpoint

Bilingual capability

Because IkengaTTS is fine-tuned from an English-trained base, it retains some English speech capability. It can also handle Igbo-English code-switching, common in everyday Nigerian speech. Prosody in mixed-language sentences is a known limitation.

English (fine-tuned model)
English sentence spoken by the Igbo-adapted model
English (base model)
Same sentence from the original English-only model
Code-switching (Igbo + English)
Mixed-language sentence, common in everyday Nigerian speech

Evaluation results

Evaluated on 987 held-out samples from the African Voices dev_test set (sampled from 6,937 clips). We measure ASR-transcribability (WER via MMS), speaker similarity (WavLM cosine), and predicted speech quality (UTMOS). The key metric is WER delta (generated minus ground truth), which controls for the ASR model's baseline errors on Igbo.

Model MMS WER ΔWER SIM-o UTMOS
Ground truth recordings 55.6% 1.28
IkengaTTS (ours) 50.3% −5.3 pp 0.963 1.55
Base F5-TTS (no fine-tuning) 100.5% +44.9 pp 0.958 1.55

Key finding: Our fine-tuned model is more transcribable by MMS than the original recordings (WER 50.3% vs 55.6%), likely because synthesis regularizes pronunciation. The base F5-TTS model completely fails (WER >100%), confirming that fine-tuning is necessary. All differences are statistically significant (Wilcoxon signed-rank, p < 10−159).

Caveat: UTMOS is trained on English and unvalidated for Igbo. Human listening tests with native Igbo speakers are essential and planned as future work.

How it works

IkengaTTS fine-tunes the full 335M-parameter F5-TTS model on Igbo data. F5-TTS uses flow-matching (a type of diffusion) with a Diffusion Transformer (DiT) backbone. It takes a reference audio clip and target text, then generates speech that matches the reference speaker's voice while saying the target text.

Text processing: We use character-level tokenization that preserves Igbo diacritics and tone marks. The vocabulary was extended from 2,546 to 2,604 characters to cover Igbo-specific characters (ị, ọ, ụ, ñ, and their toned variants).

Training: 8 epochs on 4×H100 GPUs, ~15 hours total. Loss plateaus early but perceptual quality continues improving through later epochs.

What failed: Before this approach, we tried Kokoro (AR collapse on 934 clips), FastSpeech2 at 25M and 1.1M params (silent output), and VITS from scratch on 112K clips (recognizable speech but 40-55% duration compression). Transfer learning was the only approach that produced usable results.