What is a word? An introduction to computational linguistics.
What is a word? This question is one of the most deceptively simple ones I know. Everyone will say they know the answer, or at least say they know one when they see one, but even native speakers of a language can and do disagree. The dictionary isn't much help since many dictionaries have multi-sentence, ad hoc definitions which basically boil down to "a word is a unit of language that means something, sort of."
Let's jump ahead and assume we know what a word is, or that we can get native speakers to identify most words most of the time. Furthermore, let's say that our goal is to get a computer to understand a given language. Since humans learn languages initially by learning words and basic grammar it seems like a good choice to try and get computers to recognize words. So, our goal: given a string of English letters insert spaces between the words.
What is a word?To show that the above exercise isn't totally contrived let's look at some of the subtleties in the idea of the word. This is only for people interested in the "linguistics" part of "computaitonal linguistics," but if you want to read it then click here.
Words are linguistic constructs, not orthographic ones.Language preceeds writing. Children can speak and comprehend a language before they learn to read and write in that language. That is to say nothing of people who are illiterate or languages which have no formalized writing system. Furthermore, when learning a foreign language one typically learns words and basic grammar long before learning how to write, particularly if the writing system is dramatically different, e.g., an English-speaker learning Chinese or Arabic. This is all to say that words are not just formal orthographic constructs like quotation marks or apostrophes. Words appear to have some linguistic reality and are therefore worth studying from a language perspective.
None of this tells us what a word is, only that speakers of a language believe there is such a thing as a word. English speakers can still say, however, that what we see on paper is more-or-less accurate: spaces represent breaks between words. This dodges the question since the idea of a "word" exists cross-linguistically. The definition of a word, therefore, should encompass all such contingencies.
Synthetic languagesA few definitions, first. A morpheme is the smallest unit of language with a meaning. "Dogs" has two morphemes: "dog" and "s," with the former indicating a canine and the latter indicating multiplicity. Similarly, "looked" as two morphemes, with the suffix "-ed" indicating the tense of the verb "look." Synthetic languages have a high morpheme/word ratio. English-speakers might be familar with the comically long German word. In fact, German allows for essentially arbitrarily long words. DonaudampfschiffahrtsgesellschaftskapitÃ¤n, for example, means "Danube company steamship captain." Is this one word or four? And even in the English translation, is "steamship" one word, or two?
Another extreme are synthetic languages like Hungarian which have many more grammatical affixes than English or German. The ideas of "conditional," "future," etc. are all marked by single morphemes attached to a word. In English "I would go" is three words, but in Hungarian it would be appear to be just one or perhaps two.
Phonetics versus OrthographyBut most linguists wouldn't even say the above isn't that relevant. At best it just provides us with more evidence that words are something real. When we speak, however, there is no "space" between words in the same way that there are spaces between words in English. If you've ever learned a foreign language you probably remember a point at which you realized you were hearing words rather than sounds. Before that it sounded like one continuous stream of nonsense — and you were right about the "one continuous stream" part, anyhow.
Once you have learned some of the language, however, the patterns become clear in your mind and words jump out at you. That is, humans are able to take a single unbroken string of "characters" (i.e., sounds) and break them into words. Imagine if instead of parsing a string of English characters we were parsing some phonetic representation of a sentence. Suddenly our exercise no longer seems so uninteresting; indeed, we'd be doing something very much like what humans do when they parse a language they understand.
AssumptionsObviously we can't integrate all of the subtleties above as that would be tantamount to writing a computer program which actually processed text in the same way humans do. Rather, we will work under the following assumptions: first, we already have a database (called the "lexicon") of words; second, this database is complete. The first assumption isn't totally off-the-wall since it's a general working assumption among linguists that humans have just such a database. The second, however, is much harder to swallow since the lexicon is typically understood to contain root morphemes plus general information about the morphology, phonology, phonotactics, etc. of the language.
If I said "koop" were a verb, you'd know right away that "kooped," "koops," "kooper," etc. were all also valid words. Likewise, even though "cromulent" is not actually an English word an English speaker knows that it could be (and that, furthermore, it would probably be an adjective), but that "plkdjfhg" could never be an English word. Our database, however, is very dumb and very uncompressed: every permutation of every word should be present, otherwise that permutation won't be counted as a word. We're only making this assumption to simplify the problem. I may be a pretty good programmer, but I'm not good enough to write a computer program which automatically learns a language's syntax, morphology, and phonology.
Enough chit chat, let's get to the code.
The AlgorithmThe algorithm I'm going to use is a simple probabalistic dynamic programming algorithm. Let's say we have a string like "therentisdue" and want to parse it as "the-rent-is-due." Assuming our training data is representative of the language as a whole (a big assumption, for sure) then we know the probability of each word is #occurances of the word in the data over the total number of words in the data. The idea is that the best parse of a string, given our training data, is the parse which has the highest probability of occuring.
For the CS students out there this should scream "dynamic programming." For everyone else, I'll explain. The most obvious way to find the parse with the highest probability is to find every possible parse and then find that parse which has the highest probability. Implementing the algorithm this way is intractable since there are 2n-1 parses (why?). Instead we'll do the following. The pseudo code:
BestParse := "" FOR i in [1..length of StringToParse] DO FOR j in [0..i) DO parse := BestParse[j] + StringToParse[j,i] IF COST(parse) < COST(BestParse[i]) THEN BestParse[i] = parse ENDIF ENDFOR ENDFOR DEFINE COST(parse) return -LOG2(PROBABILITY(parse)) END DEFINE PROBABILITY(parse) return product of the frequencies of each word in parse END
Let the input string be s. At each point i, that is, for the initial i-length substring of s, determine what the best parse up to i is. Now, let's say we know what the best parse at i is for some fixed i. To find the best parse at i+1 we try to insert a break after each initial j substring, for j < i+1. Since we've been keeping track of the best parse (and cost) at each such j the whole time, we just see which break insertion is the cheapest.
Here is an illustration, again with "therentisdue." Let's say we have "therenti" parsed so far. This means we know the best parse for each initial substring of this string, e.g., "t", "th", "the", etc. The best parse will probably be "the-rent-i" since each of these is a word and every other parse contains at least one non-word. Now let's see how the algorithm determines the best parse of "therentis" from this.
After each character in the string we need to decide whether or not to insert a break. Should we insert a space after the first character? Well, yes, since the best parse of a single character is definitely that character. So at the first step we get "t-|herentis." If we're favoring single letters over non-words (it's our choice to make) then the best parse after the second character would be "t-h-|erentis." After the third, however, the parse is "the-|rentis" since "the" is a word and therefore the best parse of the first three letters is "the" (we know this because, by assumption, we have already computed the best parse for "the"). Next we get "the-r-|entis," followed by "the-re-|ntis," and so on, until we get to "the-ren-|tis." After this step we try "the-rent-|is." This is a very good parse since we have three words. Finally, we try "the-rent-i-|s," which has a lower probability than the previous parse because "s" is not a word. Therefore "the-rent-is" is the parse which we save as the best parse of "therentis."
I implemented this algorithm using C++, which you can download here. By default it uses the KJV Bible as training data, which means what it considers words can be a little funny. For example, "sin" is considered a very common word.