Week 1: Language Modeling
This is my first week coding my Google Summer of Code project with Red Hen Lab. My first task will be preparing the language model to use for decoding.
Aalto system’s language model
Baseline system: n-gram model
For the baseline, we will use VariKN [1] to train an n-gram model. The authors of the Aalto system [2] trained an n-gram model with 8 million n-gram contexts (they tuned the pruning parameters to reach this number of n-gram contexts). More than 7 million of the contexts were of order three or lower.
The lexicon we will use for decoding is grapheme-based.
Improving the Language Model: The RNNLM
The second task will be to improve the language modeling component. To do this, we will train a recurrent neural network language model (RNNLM) using TheanoLM [3]. This model is to be used for lattice rescoring.
The RNNLM architecture
The authors proposed the following architecture for the RNNLM.
Module | Characteristics |
---|---|
Projection layer | 200 neurons for character and subword models 300 neurons for the word model |
A hidden LSTM layer | 1000 neurons for charachter and subword models 1500 neurons for word model |
Output layer | Uses Softmax activation function Layer size depends vocabulary size Words and subwords have to be grouped into classes using the exchange word clustering algorithm (the authors used 2000 classes) |
Training method | Backpropagation |
Optimization algorithm | Adagrad |
Minibatch size | 64 for character and subword models 32 for word models |
Sequence length for each minibatch | 100 for character model 50 for sub-word model 25 for word model |
Initial learning rate | 0.1 |
Dropout rate | 0.2 |
Maximum number of iterations | 15 |
So, Wish us luck!
References
[1] Vesa Siivola, Teemu Hirsimaki, and Sami Virpioja, “On growing and pruning Kneser-Ney smoothed n-gram models,” in IEEE Transactions on Audio, Speech & Language Processing, vol. 15, no. 5, pp. 1617–1624, 2007.
[2] P. Smit, S. Gangireddy, S. Enarvi, S. Virpioja, and M. Kurimo, “Aalto system for the 2017 Arabic multigenre brodcast challenge,” in ASRU, 2017.
[3] Seppo Enarvi and Mikko Kurimo, “TheanoLM an extensible toolkit for neural network language modeling,” in INTERSPEECH 2016 – 17th Annual Conference of the International Speech Communication Association, San Francisco, September 2016, pp. 3052–3056.