The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

Omar F. Zaidan and Chris Callison-Burch
Johns Hopkins University


Abstract

The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true "native" languages of Arabic speakers used in daily life. However, due to MSA's prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-2007.pdf