Welcome to the Arabic ASR page!
In this page we will summarize the components of the Arabic ASR project, including the training process and the pipeline, and the pipeline which takes as input a file (or a set of files) and outputs the recognition results as subtitle files.
The project uses Kaldi for training a speech recognizer on Arabic data. This section contains information about the acoustic and language modeling approaches used.
Language modeling is done using VariKN. The script
produce_n_gram_lm.sh produces an n-gram language model from a VariKN corpus (to convert a Kaldi text or plain text to VariKN corpus, see utilities
plain_text2variKN_corpus.py below, respectively).
The acoustic model is created using Kaldi. The code to create the model is found in
run.sh. Note that to create the model, Kaldi requires
utils directory and
local directory to be placed within a Kaldi example inside the Kaldi example directory (e.g.: kaldi_root/egs/arabic_asr/s5).
The acoustic modeling component builds multiple models sequentially, using the alignments from each model as input to the next model. The program performs the following tasks in the order specified:
- Extract features for the whole data set
- Split the data set into training and test sets with speaker independence
- Use the language modeling component to build a language model with variKN
- Train a monophone GMM using the shortest 10,000 utterances from the training set
- Train a tri-phone GMM using alignments from the monophone model
- Train a tri-phone GMM on top of features transformed with LDA, using alignments from the previous tri-phone model
- Train a speaker adaptive training (SAT) model using alignments extracted from the tri-phone GMM, trained on LDA-transformed features
- Perform cleaning and segmentations using Kaldi’s clean and segment script
- Train a Speaker Adaptive Training (SAT) tri-phone GMM on the cleaned and segmented data using alignments generated by the cleaned model
- Perform volume and speed perturbation on the training data
- Extract features for the perturbed data
- Train an LDA+MLLT system using alignments from the previous SAT model
- Train a UBM-GMM on 20% of the perturbed data using the LDA+MLLT system
- Train an i-vector extractor
- Extract i-vectors for the training and test sets
- Train a TDNN using features and i-vectors from the training set
The pipeline contains code to take a wave file (or a group of wave files within a directory), and output SRT subtitle files for each of the files. The
pipeline/run.sh builds an example directory for the file(s) specified, extracts features and finally performs decoding to output subtitle files.
The utils directory contains a number of useful utilities to be used in the training process.
utils/ctm2srt.py: Produces SRT subtitle files from a CTM file (the file format in which decoding results are stored).
utils/Kaldi_lex2variKN_vocab.py: Produces a VariKN vocabulary file from a Kaldi lexicon.
utils/kaldi_text2grapheme_lexicon.py: Extracts a grapheme lexicon from a Kaldi text file.
utils/Kaldi_text2plain_text.py: Transforms a Kaldi text file to plain text through removing utterance ID’s.
utils/Kaldi_text2variKN_corpus.py: Transforms a Kaldi text to a VariKN corpus through removing utterance ID’s and adding VariKN start and end tags.
utils/ldc_corpus2kaldi_dir.py: Transforms an LDC corpus to a Kaldi data directory.
utils/plain_text2variKN_corpus.py: Transforms plain text to a VariKN corpus through adding start and end tags.
utils/remove_test_speakers.py: Removes speakers in one Kaldi directory from another Kaldi directory (useful for splitting a data set into training and test sets, in which it helps to avoid speakers being shared between sets.).
utils/transliteration.py: Contains functions to perform transliteration on a sentence.