IkengaTTS — Igbo Text-to-Speech

What is this?

IkengaTTS is a text-to-speech system for Igbo, a tonal language spoken by 30-45 million people in Nigeria and its diaspora. We fine-tuned F5-TTS (335M parameters) on 112,521 Igbo clips (255 hours) from the African Voices dataset. The result is the first open, evaluated Igbo TTS system with tone-aware text processing.

Multiple from-scratch approaches failed—autoregressive collapse, non-autoregressive silence, duration compression. Transfer learning from a large English-trained model was the key that worked.

335M

parameters

112K

training clips

255

hours of speech

987

eval samples

Why tone matters

Igbo is a register tone language. The same sequence of consonants and vowels can mean completely different things depending on pitch. A TTS system that ignores tone marks doesn't just sound wrong—it says the wrong word.

The word "akwa" has four meanings:

ákwá

"cry"

high-high

àkwà

"cloth"

low-low

ákwà

"bed"

high-low

àkwá

"egg"

low-high

Base model vs. fine-tuned

The base F5-TTS model (trained on English) cannot speak Igbo at all. After fine-tuning on African Voices data, the model produces intelligible Igbo speech. Listen to the difference:

Female voice (IECT1F80 — top-ranked speaker)

Base F5-TTS (no fine-tuning)

Attempts Igbo text — unintelligible output

IkengaTTS (fine-tuned)

ΔWER −12.3 pp, SIM-o 0.977 (best female speaker)

Male voice (IECT1M41 — best male UTMOS)

Base F5-TTS (no fine-tuning)

Base model attempting Igbo

IkengaTTS (fine-tuned)

ΔWER −5.7 pp, SIM-o 0.976, UTMOS 2.53 (highest quality male)

Speaker diversity

The model clones the voice of a reference speaker. Here are samples across different speakers from the eval set, showing that the model generalizes across voices, genders, and speaking styles.

Female speakers

IECT1F80 — Rank #1 (ΔWER −12.3 pp, n=75)

IECT2F16 — Rank #4 (ΔWER −10.1 pp, n=64)

IECT2F60 — Rank #7 (ΔWER −7.8 pp, n=98)

Female — long-form narration

Male speakers

IECT1M25 — Rank #2 (ΔWER −11.4 pp, n=14)

IECT1M41 — Rank #10, best UTMOS 2.53 (n=19)

IECT1M62 — Rank #13 (n=13)

Male — news style narration

Training progression

The same text synthesized at different training checkpoints. The model goes from noise to English-accented babble to intelligible Igbo over ~210K gradient updates (~8 epochs). Training loss plateaus early (~0.69 by epoch 2) but perceptual quality keeps improving.

500 updates

Noise

1K updates

Noise with rhythm

5K updates

Speech-like

10K updates

English accent

25K updates

Igbo emerging

50K updates

Recognizable Igbo

85K updates

Clearer diction

120K updates

Natural prosody

150K updates

Refined

210K updates

Final checkpoint

Bilingual capability

Because IkengaTTS is fine-tuned from an English-trained base, it retains some English speech capability. It can also handle Igbo-English code-switching, common in everyday Nigerian speech. Prosody in mixed-language sentences is a known limitation.

English (fine-tuned model)

English sentence spoken by the Igbo-adapted model

English (base model)

Same sentence from the original English-only model

Code-switching (Igbo + English)

Mixed-language sentence, common in everyday Nigerian speech

Evaluation results

Evaluated on 987 held-out samples from the African Voices dev_test set (sampled from 6,937 clips). We measure ASR-transcribability (WER via MMS), speaker similarity (WavLM cosine), and predicted speech quality (UTMOS). The key metric is WER delta (generated minus ground truth), which controls for the ASR model's baseline errors on Igbo.

Model	MMS WER	ΔWER	SIM-o	UTMOS
Ground truth recordings	55.6%	—	—	1.28
IkengaTTS (ours)	50.3%	−5.3 pp	0.963	1.55
Base F5-TTS (no fine-tuning)	100.5%	+44.9 pp	0.958	1.55

Key finding: Our fine-tuned model is more transcribable by MMS than the original recordings (WER 50.3% vs 55.6%), likely because synthesis regularizes pronunciation. The base F5-TTS model completely fails (WER >100%), confirming that fine-tuning is necessary. All differences are statistically significant (Wilcoxon signed-rank, p < 10⁻¹⁵⁹).

Caveat: UTMOS is trained on English and unvalidated for Igbo. Human listening tests with native Igbo speakers are essential and planned as future work.

How it works

IkengaTTS fine-tunes the full 335M-parameter F5-TTS model on Igbo data. F5-TTS uses flow-matching (a type of diffusion) with a Diffusion Transformer (DiT) backbone. It takes a reference audio clip and target text, then generates speech that matches the reference speaker's voice while saying the target text.

Text processing: We use character-level tokenization that preserves Igbo diacritics and tone marks. The vocabulary was extended from 2,546 to 2,604 characters to cover Igbo-specific characters (ị, ọ, ụ, ñ, and their toned variants).

Training: 8 epochs on 4×H100 GPUs, ~15 hours total. Loss plateaus early but perceptual quality continues improving through later epochs.

What failed: Before this approach, we tried Kokoro (AR collapse on 934 clips), FastSpeech2 at 25M and 1.1M params (silent output), and VITS from scratch on 112K clips (recognizable speech but 40-55% duration compression). Transfer learning was the only approach that produced usable results.