Unsupervised Decomposition of a Document into Authorial Components

Moshe Koppel1,  Navot Akiva1,  Idan Dershowitz2,  Nachum Dershowitz3
1Bar-Ilan University, 2Hebrew University of Jerusalem, 3Tel-Aviv University


Abstract

We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to essentially recapitulate centuries of biblical scholarship automatically. One of the key elements of our method is exploitation of differences in synonym choice by different authors. Another key element is a novel two-stage clustering method in which we first use one feature set to generate a partial clustering and then leverage this using supervised learning with a different feature set.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1136.pdf