Phylogenetic Models of Language Diversification

Abstract : Language diversi cation is a stochastic process which presents similarities with phylogenetic evolution. Recently, there has been interest in modelling this process to help solve problems which traditional linguistic methods cannot resolve. The problem of estimating and quantifying the uncertainty in the age of the most recent common ancestor of the Indo-European languages is an example. We model lexical change by a point process on a phylogenetic tree. Our model is speci cally tailored to lexical data and in particular treats aspects of linguistic change which are hitherto unaccounted for and which could have a strong impact on age estimates: catastrophic rate heterogeneity and missing data. We impose a prior distribution on the tree topology, node ages and other model parameters, give recursions to compute the likelihood and estimate all parameters jointly using Markov Chain Monte Carlo. We validate our methods using an extensive cross-validation procedure, reconstructing known ages of internal nodes. We make a second validation using synthetic data and show that model misspeci cations due to borrowing of lexicon between languages and the presence of meaning categories in lexical data do not lead to systematic bias. We fit our model to two data sets of Indo-European languages and estimate the age of Proto-Indo-European. Our main analysis gives a 95% highest posterior probability density interval of 7110 9750 years Before the Present, in line with the so-called Anatolian hypothesis for the expansion of the Indo- European languages. We discuss why we are not concerned by the famous criticisms of statistical methods for historical linguistics leveled by Bergsland and Vogt [1962]. We also apply our methods to the reconstruction of the spread of Swabian dialects and to the detection of punctuational bursts of language change in the Indo-European family.
