Empowering Setswana ASR: Fine-Tuning Whisper for Code-Switching | by Kesego Mokgosi

In today’s multilingual world, automatic speech recognition (ASR) systems must handle code-switching,the phenomenon where speakers mix two or more languages within a conversation. This is especially critical for languages like Setswana,a low resource language, where speakers frequently mix in English. In this article, we’ll explore how to fine‑tune OpenAI’s Whisper model for Setswana code‑switching using a custom tokenizer and PEFT’s LoRA adapter. We’ll walk through key code snippets and explain why building such models is so important.

Why Setswana Code‑Switching Models Matter

Many native Setswana speakers naturally incorporate English phrases during conversation. A robust ASR system must be able to accurately transcribe both languages and handle transitions seamlessly. Traditional models trained on monolingual data might miss nuances or mis-transcribe English words, resulting in higher error rates. Fine‑tuning a multilingual model like Whisper on a Setswana dataset (with code‑switched examples) offers several benefits:

Improved Accuracy: Adapting the model to the linguistic and phonetic characteristics of Setswana (with embedded English) reduces transcription errors.
Better User Experience: Users get more accurate, natural transcriptions that reflect real-world speech patterns.
Inclusive Technology: Local languages and dialects are better represented, ensuring that technology serves diverse communities.
Enhanced Downstream Applications: More accurate transcriptions improve performance in voice assistants, subtitling, and language learning tools.

Fine‑Tuning Whisper for Setswana Code‑Switching

The approach involves three main steps:

Custom Tokenizer: We add a custom token (<|tn|>) to mark Setswana language segments.
LoRA-based Fine‑Tuning: We use Parameter‑Efficient Fine‑Tuning (PEFT) with LoRA to adapt only a small subset of model parameters.
Evaluation and Inference: We deploy the fine‑tuned model in an ASR pipeline for evaluation and real‑world usage.

Below is a code snippet that shows how to load a custom tokenizer and fine‑tuned model from the Hugging Face Hub, create an ASR pipeline, and run inference on an audio file.

The fine‑tuned model (with LoRA adapters) and the custom tokenizer are hosted on the Hugging Face Hub: "kesbeast23/whisper-large-turbo-setswana-lora" and "kesbeast23/whisper-large-turbo-setswana-lora-tokenizer"

import torch
from transformers import (WhisperFeatureExtractor, WhisperTokenizer, 
WhisperForConditionalGeneration, WhisperProcessor, pipeline)
from peft import PeftModel# Set device (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)
# Load the custom tokenizer from Hugging Face Hub
custom_tokenizer = WhisperTokenizer.from_pretrained("kesbeast23/whisper-large-turbo-setswana-lora-tokenizer")
# Load the feature extractor (usually the same as the base model)
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-large-v3-turbo")
# Load the base model and then load the LoRA adapter from HF Hub
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3-turbo")
# Resize token embeddings to match your custom tokenizer
base_model.resize_token_embeddings(len(custom_tokenizer))
# Load the fine-tuned LoRA adapter into the base model
ft_model = PeftModel.from_pretrained(base_model, "kesbeast23/whisper-large-turbo-setswana-lora")
ft_model.to(device)
ft_model.eval()
# Create a processor that combines the feature extractor and custom tokenizer
processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=custom_tokenizer)

Using the ASR Pipeline for Inference

Once the model and tokenizer are loaded, you can create a Hugging Face ASR pipeline. This allows you to transcribe an audio file with a single function call. The example below shows both a standard inference call and an inference call using forced decoder tokens to ensure Setswana transcription.

from transformers import pipeline
from IPython.display import Audio as IPyAudio, display# Create an ASR pipeline with the fine-tuned model and custom tokenizer
asr_pipeline = pipeline(
"automatic-speech-recognition",
model=ft_model,
tokenizer=custom_tokenizer,
feature_extractor=feature_extractor,
device=0 if device=="cuda" else -1
)
# Define the test audio file path (update with your file path)
base_dir = "/setswana-asr/"
test_audio_path = base_dir + "audio/setswana.wav"
# Standard inference
result = asr_pipeline(test_audio_path)
print("Transcription via pipeline:", result["text"])
# Optionally, force the model to use Setswana by providing forced decoder tokens:
bos_id = ft_model.config.decoder_start_token_id
tn_id = custom_tokenizer.convert_tokens_to_ids("<|tn|>")
transcribe_id = custom_tokenizer.convert_tokens_to_ids("<|transcribe|>")
notimestamps_id = custom_tokenizer.convert_tokens_to_ids("<|notimestamps|>")
forced_ids = [(0, bos_id), (1, tn_id), (2, transcribe_id), (3, notimestamps_id)]
ft_model.config.forced_decoder_ids = forced_ids
result_forced = asr_pipeline(test_audio_path)
print("Transcription with forced Setswana:", result_forced["text"])
# Play the audio file for reference
display(IPyAudio(filename=test_audio_path))

Why Use Forced Decoder Tokens?

By default, the model will use its internal language detection to decide which language to transcribe. In code‑switched contexts, you might want to ensure the model remains in Setswana mode. Forced decoder tokens let you provide a hint by forcing the model to start with specific tokens (e.g., <|tn|>, <|transcribe|>, <|notimestamps|>), which can help maintain consistency in transcription especially if the data predominantly consists of Setswana speech with occasional English words.