Faster and Smaller N-Gram Language Models

Adam Pauls and Dan Klein
UC Berkeley


Abstract

N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations which are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM implementation while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent \emph{lossy} compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1027.pdf