Welcome to the Arabic ASR page!

In this page we will summarize the components of the Arabic ASR project, including the training process and the pipeline, and the pipeline which takes as input a file (or a set of files) and outputs the recognition results as subtitle files.


The project uses Kaldi for training a speech recognizer on Arabic data. This section contains information about the acoustic and language modeling approaches used.

Language model

Language modeling is done using VariKN. The script produce_n_gram_lm.sh produces an n-gram language model from a VariKN corpus (to convert a Kaldi text or plain text to VariKN corpus, see utilities Kaldi_text2variKN_corpus.py and plain_text2variKN_corpus.py below, respectively).

Acoustic model

The acoustic model is created using Kaldi. The code to create the model is found in run.sh. Note that to create the model, Kaldi requires run.sh, the utils directory and local directory to be placed within a Kaldi example inside the Kaldi example directory (e.g.: kaldi_root/egs/arabic_asr/s5). The acoustic modeling component builds multiple models sequentially, using the alignments from each model as input to the next model. The program performs the following tasks in the order specified:

  1. Extract features for the whole data set
  2. Split the data set into training and test sets with speaker independence
  3. Use the language modeling component to build a language model with variKN
  4. Train a monophone GMM using the shortest 10,000 utterances from the training set
  5. Train a tri-phone GMM using alignments from the monophone model
  6. Train a tri-phone GMM on top of features transformed with LDA, using alignments from the previous tri-phone model
  7. Train a speaker adaptive training (SAT) model using alignments extracted from the tri-phone GMM, trained on LDA-transformed features
  8. Perform cleaning and segmentations using Kaldi’s clean and segment script
  9. Train a Speaker Adaptive Training (SAT) tri-phone GMM on the cleaned and segmented data using alignments generated by the cleaned model
  10. Perform volume and speed perturbation on the training data
  11. Extract features for the perturbed data
  12. Train an LDA+MLLT system using alignments from the previous SAT model
  13. Train a UBM-GMM on 20% of the perturbed data using the LDA+MLLT system
  14. Train an i-vector extractor
  15. Extract i-vectors for the training and test sets
  16. Train a TDNN using features and i-vectors from the training set

The pipeline

The pipeline contains code to take a wave file (or a group of wave files within a directory), and output SRT subtitle files for each of the files. The pipeline/run.sh builds an example directory for the file(s) specified, extracts features and finally performs decoding to output subtitle files.


The utils directory contains a number of useful utilities to be used in the training process.

  • utils/ctm2srt.py: Produces SRT subtitle files from a CTM file (the file format in which decoding results are stored).
  • utils/Kaldi_lex2variKN_vocab.py: Produces a VariKN vocabulary file from a Kaldi lexicon.
  • utils/kaldi_text2grapheme_lexicon.py: Extracts a grapheme lexicon from a Kaldi text file.
  • utils/Kaldi_text2plain_text.py: Transforms a Kaldi text file to plain text through removing utterance ID’s.
  • utils/Kaldi_text2variKN_corpus.py: Transforms a Kaldi text to a VariKN corpus through removing utterance ID’s and adding VariKN start and end tags.
  • utils/ldc_corpus2kaldi_dir.py: Transforms an LDC corpus to a Kaldi data directory.
  • utils/plain_text2variKN_corpus.py: Transforms plain text to a VariKN corpus through adding start and end tags.
  • utils/remove_test_speakers.py: Removes speakers in one Kaldi directory from another Kaldi directory (useful for splitting a data set into training and test sets, in which it helps to avoid speakers being shared between sets.).
  • utils/transliteration.py: Contains functions to perform transliteration on a sentence.