Fully Unsupervised Word Segmentation with BVE and MDL

Daniel Hewlett and Paul Cohen
University of Arizona


Abstract

Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2095.pdf