What is a word? An introduction to computational linguistics.

by Jesse Farmer on Wednesday, April 4, 2007

What is a word? This question is one of the most deceptively simple ones I know. Everyone will say they know the answer, or at least say they know one when they see one, but even native speakers of a language can and do disagree. The dictionary isn't much help since many dictionaries have multi-sentence, ad hoc definitions which basically boil down to "a word is a unit of language that means something, sort of."

Let's jump ahead and assume we know what a word is, or that we can get native speakers to identify most words most of the time. Furthermore, let's say that our goal is to get a computer to understand a given language. Since humans learn languages initially by learning words and basic grammar it seems like a good choice to try and get computers to recognize words. So, our goal: given a string of English letters insert spaces between the words.

What is a word?

To show that the above exercise isn't totally contrived let's look at some of the subtleties in the idea of the word. This is only for people interested in the "linguistics" part of "computaitonal linguistics," but if you want to read it then click here.

Assumptions

Obviously we can't integrate all of the subtleties above as that would be tantamount to writing a computer program which actually processed text in the same way humans do. Rather, we will work under the following assumptions: first, we already have a database (called the "lexicon") of words; second, this database is complete. The first assumption isn't totally off-the-wall since it's a general working assumption among linguists that humans have just such a database. The second, however, is much harder to swallow since the lexicon is typically understood to contain root morphemes plus general information about the morphology, phonology, phonotactics, etc. of the language.

If I said "koop" were a verb, you'd know right away that "kooped," "koops," "kooper," etc. were all also valid words. Likewise, even though "cromulent" is not actually an English word an English speaker knows that it could be (and that, furthermore, it would probably be an adjective), but that "plkdjfhg" could never be an English word. Our database, however, is very dumb and very uncompressed: every permutation of every word should be present, otherwise that permutation won't be counted as a word. We're only making this assumption to simplify the problem. I may be a pretty good programmer, but I'm not good enough to write a computer program which automatically learns a language's syntax, morphology, and phonology.

Enough chit chat, let's get to the code.

The Algorithm

The algorithm I'm going to use is a simple probabalistic dynamic programming algorithm. Let's say we have a string like "therentisdue" and want to parse it as "the-rent-is-due." Assuming our training data is representative of the language as a whole (a big assumption, for sure) then we know the probability of each word is #occurances of the word in the data over the total number of words in the data. The idea is that the best parse of a string, given our training data, is the parse which has the highest probability of occuring.

For the CS students out there this should scream "dynamic programming." For everyone else, I'll explain. The most obvious way to find the parse with the highest probability is to find every possible parse and then find that parse which has the highest probability. Implementing the algorithm this way is intractable since there are 2n-1 parses (why?). Instead we'll do the following. The pseudo code:

BestParse[0] := ""
FOR i in [1..length of StringToParse] DO
	FOR j in [0..i) DO
		parse := BestParse[j] + StringToParse[j,i]
		
		IF COST(parse) < COST(BestParse[i]) THEN
			BestParse[i] = parse
		ENDIF
	ENDFOR
ENDFOR

DEFINE COST(parse)
	return -LOG2(PROBABILITY(parse))
END

DEFINE PROBABILITY(parse)
	return product of the frequencies of each word in parse
END

Let the input string be s. At each point i, that is, for the initial i-length substring of s, determine what the best parse up to i is. Now, let's say we know what the best parse at i is for some fixed i. To find the best parse at i+1 we try to insert a break after each initial j substring, for j < i+1. Since we've been keeping track of the best parse (and cost) at each such j the whole time, we just see which break insertion is the cheapest.

Here is an illustration, again with "therentisdue." Let's say we have "therenti" parsed so far. This means we know the best parse for each initial substring of this string, e.g., "t", "th", "the", etc. The best parse will probably be "the-rent-i" since each of these is a word and every other parse contains at least one non-word. Now let's see how the algorithm determines the best parse of "therentis" from this.

After each character in the string we need to decide whether or not to insert a break. Should we insert a space after the first character? Well, yes, since the best parse of a single character is definitely that character. So at the first step we get "t-|herentis." If we're favoring single letters over non-words (it's our choice to make) then the best parse after the second character would be "t-h-|erentis." After the third, however, the parse is "the-|rentis" since "the" is a word and therefore the best parse of the first three letters is "the" (we know this because, by assumption, we have already computed the best parse for "the"). Next we get "the-r-|entis," followed by "the-re-|ntis," and so on, until we get to "the-ren-|tis." After this step we try "the-rent-|is." This is a very good parse since we have three words. Finally, we try "the-rent-i-|s," which has a lower probability than the previous parse because "s" is not a word. Therefore "the-rent-is" is the parse which we save as the best parse of "therentis."

I implemented this algorithm using C++, which you can download here. By default it uses the KJV Bible as training data, which means what it considers words can be a little funny. For example, "sin" is considered a very common word.