Garrett Mitchener's Research:

Mathematical approaches to linguistics

Back to Garrett's home page
Abstract
My mathematical interests are in dynamical systems and probability. My application is to linguistics: I'm interested in language change, and understanding why it occurs, how it spreads, and what this can tell us about the human mind. Much of language change happens inside a black box, because the mental and social processes that drive it are not directly accessible. Often, the only available information is a limited corpus of the old manuscripts that happened to survive the ages. Mathematical models and simulations can help fill in the missing information and allow researchers to test and improve theories about how we speak, write, learn, and interact. Math can also help us make the most of the limited data we have and properly connect it to theory.
On this page, I give background in linguistics and describe the problems I hope to solve.
A more detailed research statement: Read this if you are interested in the mathematical details.

What is mathematical linguistics?

My research focuses on the rather unusual combination of mathematics and linguistics. Linguistics is in many ways a social science, involving interviews and studies of documents for example, but it's also mathematical, in that one of the goals of the field is to understand the brain by finding abstract computational machines that approximate its neural machinery. And until recently, the term mathematical linguistics primarily meant exactly that: the study of abstract grammars and abstract machines, for example, regular languages, context free languages, finite state automata, Turing machines, and so forth. Many of these abstract theories of language are not actually useful for understanding human language, and they usually wind up in theoretical computer science rather than linguistics. But there's more.
More recently, mathematics has become increasingly useful for other areas of linguistics. For example, speech processing is very mathematical. Signal processing (which is based on functional analysis) and statistical learning theory are used to study speech, produce automated transcription of speech, and build human-computer interfaces based on speaking rather than typing. I've dabbled in some of this, particularly automated transcription, and believe me it's harder than it looks. Getting a computer to transcribe an utterance into phonetic symbols is extremely difficult; getting it to recognize words from a particular language under limited circumstances is easier.
The areas I'm most interested in are learning processes and population dynamics, and how these relate to actual human language, not the ideal abstractions of theoretical computer science. These areas are related to historical linguistics, which is the study of how languages change and what caused those changes. For example, here's a little bit of Old English (spoken till 1100 AD or so) from Ælfric's Homilies:
On twam þingum hæfde God þæs manes sawle gegodod.
Here's some Middle English (spoken from 1100 to 1600) from the Rule of St. Benet:
In þa dais sal we here sumþing of godes seruise.
The pronunciation, syntax, morphology and spelling of English have all changed dramatically over the centuries. And this is typical of all languages: For some reason, children at some point learn a language that's just a little different from their parents', and over the years those little changes add up into transformations of the entire language. Changes seem to occur because of complex interactions between usage patterns, child and adult learning processes, and population structure. At the moment, my main line of research is to develop mathematical tools for understanding this change process.
Another question of interest is: Assuming an evolutionary origin for the human species, how might language have evolved in humans from a non-linguistic ancestor species? Initially, it seems perfectly sensible to say that language provides tremendous survival benefit, and to attribute its appearance within humans to mutation and natural selection. But, to be satisfied with that little bit of an explanation is naive. A few thought experiments show how much deeper the question goes: If language is so useful, why did it evolve only once? (As far as we know, many other species communicate, but not with anything as complex as human syntax.) There are also bootstrapping problems. If a mutation enabling language appears in one individual, its benefit can't be realized because there's no one to talk to, so there's no selection for the mutation and it's effectively invisible to evolution. Then there's the complex system problem. Language is an extremely complex system, involving structured meanings, speech production, parsing, and modeling of the speaker's mental state. One part alone bears no obvious benefit without the rest. For instance, structured speech is pointless if no one can parse it, and the ability to parse structured utterances is useless if no one produces structured speech. However, it's astronomically unlikely that all the parts appeared at once through a massive mutation. In short, there's a lot left to be explained, and that's just for the origin of language.
We'd also like to know how evolutionary forces may have influenced the form of language, and questions like these come up: Is there some reason that nouns have gender in most languages? Why do we have contractions, irregular forms, etc? Why is it difficult for adults to learn second languages? And all sorts of theories can be proposed to explain these in terms of some survival benefit, but initially they all lie in the realm of pure speculation and “just-so stories.” Unless there is strong evidence linking an aspect of language to a proposed survival benefit, and strong evidence that the proposed benefit was actually helpful to survival at some point, there is no reason to accept that as the correct explanation.
My dissertation began to address some of the issues surrounding language and evolution. Most of it centers around the simplest non-trivial mathematical models of natural selection with imperfect learning and genetic variation in the language faculty. And despite the simplicity, the results are strikingly complex. For example, one version of the model results in chaotic oscillations among grammars. Other instances show that the traditional cartoon of evolution, in which a “better” variant of a species takes over from its ancestor, doesn't necessarily apply to language. A mutation that gives an individual greatly improved language skills that happen to be incompatible with the existing language will most likely die out because its benefit can't be realized. Another instance of the model has a property called accidental stability, because it shows how a mutation can spread or die out depending on which grammar a population chooses before the mutation appears. In short, an essentially accidental choice determines the genetic makeup of future generations, and any straightforward notion of “fitness” is thrown away. These models indicate that there are probably not explanations of the form “Human language has feature X because it provides survival benefit Y” for most of the features of human language, and that we must step back and re-think what it means to have an evolutionary explanation for something.

Why math?

So where does math come in? Historical linguistics and biological anthropology are for the most part observational sciences rather than experimental sciences. We don't get to do experiments like reset England to its state in 200 AD and let history repeat itself and see if it takes the same path. Nor can we interview people from 8000 years ago to see what their language was like. We also don't get to resurrect pre-human ancestor species and see if we can teach them language (like experiments to teach gorillas sign language). That's not to say that no experiments can ever be done, but for the most part, investigators in these fields have to stick to manuscripts and fossil records and just deal with the limited data. And here's where math comes in: Mathematics can provide models, such as the differential equations from my dissertation, that can fill in some of the experiment gap. If nothing else, a model of a hypothesis can improve the precision in which it is stated, check it for consistency, and perhaps uncover predictions that might be testable. Also, powerful tools from probability and statistics allow us to get more and more information from limited data (for example, the constant rate effect in how a change spreads in different parts of a grammar). Detailed simulations are also becoming popular, and they have the side effect of drawing together theories from different parts of linguistics (historical studies, abstract theory, and child and adult learning) and trying to improve each area through considerations from the others.

About my Middle English project

Terminology
syntax: Part of grammar dealing with how words are organized into larger structures. Ex: “John has read those books” is thought of as a sentence built from John[noun, 3rd person singular, agent, nominative], has[inflection, perfect auxiliary, present], read[verb, past participle], those[determiner, demonstrative], and books[noun, 3rd person, plural, theme, accusative]. Also those books is a phrase within the sentence as a whole.
morphology: Part of grammar dealing with how words are assembled from morphemes. Morphemes are stems, prefixes, suffixes, clitics and such. Ex: “shouldn't've” (spoken, but not written in formal English) is formed from the stem should, the clitic -n't derived from the negative particle not, and the clitic 've derived from the auxiliary have.
phonology: Part of grammar dealing with the sound system. Ex: The English plural suffix for nouns is pronounced [s] in general (“cats”), but becomes [z] if the last sound is voiced (“dogs”), and [əz] after coronal sounds (“foxes”).
At the moment, I'm trying to understand the word order of Middle English, and how it changed to the modern order. I'm working closely with linguist Anthony Kroch at the University of Pennsylvania. There are a number of reasons for picking this change as a topic of study. First, there is a parsed corpus of Middle English manuscripts that provide data for testing hypotheses. This written record is thought to reflect the spoken language fairly well, as opposed to written Old English which seems to have become a literary standard maintained in monasteries and divergent from spoken Old English. (Something similar seems to have happened in the case of written Latin, which was maintained in scientific and religious communities long after it ceased to be spoken, and never fully reflected the spoken Latin dialects that eventually gave rise to Italian, Spanish, French, etc.) At any rate, parsed corpora are simply not available yet for other languages over long periods of time. Second, Middle English underwent several changes that appear to be primarily syntactic: The verb-second rule and object-verb order both changed. This means that a model can probably be developed for understanding these changes without including morphology and phonology. Third, the loss of verb-second is particularly interesting, as it didn't occur in similar languages (such as Icelandic) and there are several proposals for what caused the change in Middle English.
What is Verb-Second?
The verb-second rule causes the verb that agrees with the subject to move to the front of the sentence, and something else must move in front of that. That is, it's a combination of verb fronting followed by topic fronting. Modern German and Icelandic use different forms of this operation. It's used in Modern English only to form questions (“What did you see?”) but it was used to form declarative statements in Old and Middle English. This rule was lost in favor of the current subject-verb-object (SVO) word order.
Specifically, these facts about Middle English seem to be keys to why it lost verb-second but other Germanic languages did not: So, my current project is to put together a simulation that includes a significant amount of linguistic realism in the hopes that it will be able to simulate the loss of verb-second in Middle English, while simulating Icelandic and other languages that stably maintain verb-second.
The general formula by which the diachronic linguistics community explains a language change is the following:
Terminology
diachronic: Studying the same language over two or more time periods spanning a change.
synchronic: Studying a language across a population during a narrow time period.
With any luck, the simulation will provide insight into what specific properties of Middle English were most instrumental in driving the loss of verb-second. It should be able to answer questions about whether random chance is enough, given the circumstances, or indicate that we need a further explanation of what drove the change. The simulation also includes a system for dealing with written as well as spoken language, so there's the possibility of comparing the results of the simulation directly to corpus data.
The last important point is that once we know what drove a language change and pinpoint when the change became possible, we're left with the question of why did it happen when it did, rather than sooner or later? Imagine for example that you have fair coin, and you flip it until it comes up heads ten times in a row. You try this one day and it takes 1479 flips. Is this surprising? Would you expect it to take more or fewer flips on average? How big of a range should you expect? Since we can't repeat language changes, simulations can give a partial answer to this question. If repeated runs of the simulation show that the change tends to take place one or two centuries, then there's no surprise if this is what the manuscript record shows. If the simulation indicates that the change should generally take place immediately thus contradicting the manuscript record, then either the simulation is flawed or the manuscript record bears a second look.
Now all I have to do is write and run the simulation...

Back to Garrett's home page
Last modified: Wed Feb 14 13:27:51 EST 2007
Revision $Id: ResearchSummary.html,v 1.8 2004/11/16 21:56:40 wgm Exp $