Natural Language as a Code: Modeling Human Language Using Information Theory
Why is natural language the way it is? Futrell proposes that human languages can be modeled as solutions to the problem of efficient communication subject to certain information processing constraints, in particular constraints on short-term memory. He will present an analysis of dependency treebank corpora of over 50 languages, in which the syntax of sentences is represented using simple graph structures, and show that word orders across languages are optimized to limit short-term memory demands in parsing, in that words that are linked by a dependency edge tend to be close in linear order. This effect is called dependency locality. Next he develops a general Bayesian, information-theoretic model of human language processing, in which short-term memory is modeled as a noisy channel, recovering dependency locality as a special case. Finally he combines these insights in a model of human languages as information-theoretic codes for latent tree structures, and show that optimization of these codes for expressivity and compressibility results in grammars that resemble human languages.