Thursday, February 09, 2006

It's a really long path to a semantically-annotated New Testament, and i'm not sure if we'll get there (or even that we know how). But, just as each journey begins with a single step, i've been thinking about some of the early steps it will take, and how to keep moving in the right direction.

One important milestone along the way is a sense-marked text, where each (content) word is annotated with its meaning. Let's take Matt.7.23 as an example

And then will I declare to them, 'I never knew you; depart from me, you workers of lawlessness.'(Matt.7.23, ESV)

Some of these are simple function words (like "to") that don't really carry independent meaning. "declare" and "know", on the other hand, are verbs with important semantics. This is the first step toward a sense-marked text: indicating content words, their dictionary form ("know", not "knew"), and their parts-of-speech (nouns, verb, adjectives, adverbs, etc.). No English New Testament text provides this information today, as far as i know (though something approximating it is available in texts that are indexed to Strong's Greek lexicon, namely the KJV and New American Standard Version).

The next challenge is determining the actual semantics of each (content) word. You can't just look this up in a dictionary because most words are ambiguous, some very much so. This is harder for us to grasp because our amazing ability to understand language in its context makes the process so transparent. For example, the verb "declare" has the following definitions in WordNet 2.1, a semantically-oriented lexicon of English that groups words with similar meanings:

  1. state emphatically and authoritatively; "He declared that he needed more money to carry out the task he was charged with"
  2. announce publicly or officially; "The President declared war"
  3. state firmly; "He declared that he was innocent"
  4. declare to be; "She was declared incompetent"; "judge held that the defendant was innocent"

(and a few others, like making a declaration of dutiable goods to a customs agent). It's pretty easy to eliminate senses 2 and 4. The examples don't help quite enough in distinguishing 1 and 3, but the other information in Wordnet (number of examples of usage, and the semantic hierarchy) make it pretty clear than sense 3 is a more formal kind of declaration (so sense 1 is actually a hypernym or more specific superordinate of this sense!).

This kind of annotation is a little trickier: it requires judgement about what words mean in their context (without context, any of the various sense are possible, though the earlier ones are more likely a priori). But this kind of sense tagging, particularly when coupled with a semantic hierarchy like WordNet, gives you some really interesting new capabilities. Any search engine will let you find Bible verses with the term "declare": some will let you look for wild cards, so "declar*" would find "declare", "declares", "declaring", etc. This doesn't solve the "know" vs. "knew" problem: for that, you still need the dictionary forms. This is what the Hyper-concordance does: and while i don't think it's a "killer app" for Bible search, the data underneath it really is important for these purposes.

Going beyond these cases, though, you can do much better with this kind of lexical semantic annotation: for example, you can search for instances of communication verbs (declare, announce, answer) that have the meaning of communication with words (as opposed to answer meaning the answer to a problem). Of course, we don't have this data for any English versions yet (though some Greek search tools approximate it), so we don't have this kind of search tool either. Solving this problem in a general and automatic way for unrestricted English is still a hard research problem. But we don't need that for the New Testament, because it's a closed corpus: we just need to do the work of marking it up once. That sounds very hard, but there are ways to bootstrap the process (using resources like WordNet, and the fact that we have some versions that are keyed to their Greek terms, which can help with disambiguation).

With that lengthy preamble, then, here are some thoughts on how to jump-start the process of sense-tagging the New Testament.

  1. Take a version like the New American Standard that has annotations of the Greek dictionary forms (via Strong's numbers).
  2. Add the annotation of the English base form (the technical term for this is lemma) so you know that "knew" is an instance of "know". I have code to do this behind the Hyper-concordance: i've only configured it for the vocabulary of the RSV and ESV at this point, and it's not perfect, but it's a pretty good start.
  3. Assigning parts-of-speech can also be usefully approximated through a combination of several factors:
    • the fact that some English words have only one part-of-speech (for example, "declare" is only a verb)
    • for ambiguous words like "light" (which can be a noun, a verb, or an adjective), the mapping to Greek will often assign a correct part-of-speech, especially for more literal translations like NASB. This isn't perfect, though: in Matt.5.15, "it gives light to all who are in the house", the Greek part-of-speech is a verb, not a noun.
    • there are some automated and open source approaches to part-of-speech tagging with good performance (like Eric Brill's tagger), though the New Testament has some fairly unique vocabulary that might do a little worse
  4. Now the tricky part: how to assign senses (assuming you're using WordNet senses as the target)? There are several sources of information:
    • Using WordNet, you look up the possible senses (technically synsets) for the English term in its relevant part-of-speech, which are ordered by frequency of occurrence. Each synset gives you a set of one or more terms: for example declare#1 and #3 have only the term "declare", while declare#2 has ["announce", "declare"].
    • the NASB lexicon also has a set of English glosses for the Greek term, ordered by their frequency of use for translation: for "declare", that's Strong's ID G3670, homologeo. homologeo is variously translated throughout the NASB as "acknowledge", "admit", "confess"/"confessed"/"confesses"/"confessing", "declare", "give thanks", "made", "profess", and "promised". Each of these also have frequencies, though they need to be collapsed across variants (like in step #2 above) so you get the count for all forms of "confess".

    At this point we need some metrics and heuristics to decide between the various possible WordNet senses. Two easy ones: more frequent senses (in general English) should be more likely in the NT too, and more general senses should be more likely that their subordinate ones. Is the case here with "declare", where one sense is a hypernym of other, unusual for the subset of English represented in a New Testament translation? I don't know. But just making the first choice on these grounds is likely to be right a very large percentage of the time.

  5. There will probably be plenty of rough edges still to work out: what if a NT term isn't in WordNet at all? (but hey, they've got propitiation and leaven, so the coverage must be pretty complete) In general, i haven't had enough experience with WordNet to know how to use it most effectively: but there must be some benefit here. When the glosses from the NASB lexicon overlap with the glosses for a particular WordNet sense, that sense is more likely. The most challenging case is where the WordNet synset only provides one term (like "declare" senses 1 and 3): then what do you do? You could try the approach of taking the immediate hypernyms of each term, and measure overlap (or semantic distance) from the NASB glosses here as well.

Bottom-line: suppose the best you can do is to identify and rank the choices, and present them in an usable interface that makes it easy to manually select the right sense. I'd estimate this could be reviewed in a matter of a few person weeks by someone who understood the issues and was willing to do the work. Though that's still a fair amount of effort, it would create a valuable resource for Bible study and search (not to mention linguistic research in general).

8:23:13 AM #  Click here to send an email to the editor of this weblog.  comment []  trackback []