Local Histograms of Character N-grams for Authorship Attribution

Hugo Jair Escalante1,  Thamar Solorio2,  Manuel Montes-y-Gomez3
1Universidad Autonoma de Nuevo leon, 2The University of Alabama at Birmingham, 3The University of Alabama at Birmingham and INAOE


Abstract

This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1030.pdf