Blogos

The Vision of a Semantic New Testament

The Vision

And the Lord answered me:
“Write the vision;
make it plain on tablets,
so he may run who reads it. (Hab 2:2, ESV)

The advent of personal computing and the Internet has brought much wider availability of God's Word, the Bible, in electronic form. Translations are available on-line in many languages and versions (Bible Gateway has an excellent list), and there are numerous search interfaces available. Other sites like crosswalk.com provide a variety of study tools, including support for the original Biblical languages of Greek and Hebrew. The creators of the new English Standard Version translation provide web service interfaces to enable network retrieval of Scripture passages. The Bible Technologies Group has a mission to "maximize production, distribution, access, use, impact, and preservation of the Bible and related texts from all time periods", including developing standards for the markup of Scripture and related texts (the Open Scripture Information Standard (OSIS)). Other groups are providing desktop applications, both commercial and open-source (like the CrossWire Bible Society), to aid personal Bible reading and study.

These and many other projects have made great progress in providing people with the opportunity to read and process the words of Bible electronically. However, all are limited by the same fundamental restriction that affects all general uses of electronic text: they represent words, not meanings.

For example, suppose you want to search for Bible verses that address the sin of pride. Your only option is to imagine the various words that might express that concept in a particular translation. "pride" is an obvious choice: the adjectival version "proud" requires a little more thought. You'll probably need a thesaurus to come up with other synonyms like "haughty", "conceited", or "arrogant" (but don't forget "arrogance"). Only those with substantial Biblical experience are likely to think of figurative expressions like "puffed up". If you use the Message translation, you'll need to include "head" for 1 Timothy 3:6: "He must not be a new believer, lest the position go to his head ...": but of course, including a general word like this will bring in many other verses that have nothing to do with pride. On top of all this, any such search will mistakenly include a different sense of pride referring to legitimate pleasure in others: "I have great pride in you" (2 Cor 7:4, ESV).

The goal of the Semantically-Annotated New Testament Project (SemANT) is this:

To annotate the New Testament with a formal semantic representation based on open Internet standards, producing a sharable resource that supports practical applications like meaning-based automated processing and integration with other resources.

The sections below explain the various aspects of this vision.

Formal Semantic Representation

Human languages are designed for communication with intelligent beings, not machines. This is one reason that computer programs are specified in formalized languages like C++ or Java, not ordinary English. There are several well-known problems:

the same term may have multiple meanings which are only disambiguated by syntactic constructions, context, or world knowledge

people naturally interpret in context, and use conjunctions, pronouns, and other linguistic devices as shortcuts to streamline discussion of topics that have already been introduced. Capturing and transmitting this context to machines is extremely difficult.

the human language phenomema of tense, modality, and inter-clausal relationships (to name a few) are complex. Representing the content of the Bible will probably require going beyond existing linguistic research in areas like figurative language.

One can think of the result of the SemANT Project as a Bible translation like the Vulgate, the King James Version, and others. However, for this translation, the target language is a formal semantic representation which is computer readable, rather than a human language meant for consumption by humans. The SemANT Project will draw on existing models and research in knowledge representation, ontology specification, lexical semantics, and translation science to accurately represent the linguistic denotation of the original Greek text. This includes identifying precisely who pronouns refer to, representing ambiguity where it is an inherent part of the original language, and incorporating context which is critical to the linguistic meaning. SemANT will specify as much of the content as is clear from the original, but no more.

For example, Jesus said "If anyone loves me, he will keep my word..." (John 14:23, ESV). This passage describes two actions (loving Jesus and keeping his word), and a conditional relationship between them (IF anyone loves me, THEN he will keep my word). However, it does not describe with full precision what it means to love Jesus, or keep his word, or even what his word is: it merely represents these concepts, using terms in natural language. A good translation stops short of elaborating in further detail than the source text: that is the task of interpretation, not translation.

Some might object that it is not possible in principle to faithfully represent the meaning of the New Testament. However, if this were true, all English translations would also be inadequate, and every follower of Jesus would need to learn Greek so they could read the New Testament texts in their original language. But if it is possible to faithfully translate the Scripture into other languages, then it must be possible to understand the meaning deeply enough to represent it directly. Eugene Nida, a pioneer in the practice of Bible translation, points out that "since no two languages are identical, there can be no absolute correspondence between languages. Hence, there can be no fully exact translations. The total impact of a translation may be reasonably close to the original, but there can be no identity in detail" (cited in Venuti 2000, p 127). This restriction will be true of SemANT as well, but no more so than for translations into human languages, and perhaps (depending on the details of the semantic formalism) even less so.

Given the complexity of meaning represented in natural language, creating these semantic annotations will require careful manual work by those with deep understanding of the original texts. While existing electronic resources can be harnessed to speed the task, and appropriate tools will improve productivity and decrease the prospect of errors, there is simply no alternative to a great deal of detailed work by intelligent humans.

Sharable Resource

It is important than SemANT be non-proprietary and freely available to support the work of Bible translation, technology development, and personal study. This is only likely if the investment of effort to produce SemANT is not tied to commercial interests.

Just as important as avoiding commercial barriers to sharing is the requirement that SemANT support existing and emerging standards that enable use across the Internet. To this end, SemANT will build on the Semantic Web Activity of the World Wide Web Consortium (W3C), including XML as a syntactic standard for data interchange, and RDF for ontology-based representation, and DAML/OWL for additional semantic expressiveness. The W3C vision includes extending these standards to include logic as well, in a layered approach illustrated here from Tim Berners-Lee's talk at XML 2000 (from Mike Dean's DAML tutorial).

Practical Applications

While this effort will both draw upon and perhaps extend research results in the area of Bible translation, knowledge representation, and computational linguistics, the focus of this effort is not primarily academic but practical.Just as translating the Bible into a new language makes the message available to new readers, it is my hope that the Semantic Bible will make the content of Scripture available in new ways. These might include

search applications that use semantics rather than words

new applications that bring together the text of the New Testament with other Bible resources

automated or semi-automated techniques for translation to human languages

Human readers will always want words (their natural means of communication), not semantic representations. But SemANT, machine-readable but keyed to the natural language text, will enable machines to organize, search, select, combine, and present the content of Scripture in new ways that are not possible based on words alone.

Challenges

The single most significant challenge is simply the amount of manual effort required to produce detailed semantic representations for human language. SemANT translators must be able to determine the original meaning and intent of the Biblical authors: while much of this knowledge can come from good English translations, it will also be essential to have some command of the original language as well as various scholarly tools. Translators must also grapple with a wide variety of subtle linguistic and semantic issues, work with a conceptual vocabulary of perhaps 10,000 terms, and understand how to create semantic representations in a formal structured language. Learning specialized computer tools will be important to streamline the work, aid collaboration, and ensure consistency.

Tom Pittman's early experience in the BibleTrans project was that "it takes an experienced Bible scholar a day or two of full-time work to encode a single verse." Given nearly 8000 verses in the New Testament and 225 work days per year, encoding one verse per day would require about 35 person years of effort to complete the task. While this estimate is daunting, it is by no means an impossible prospect. No doubt a hundred times this effort has already been invested in Bible translation by agencies like Wycliffe, and hundreds of person years have gone into the Cyc project, a broadly comparable task of manual semantic representation. However, my goal would be to find ways to bring the rate down to one verse per hour through careful design of the semantic representation and editing tools, for a total of about 5 person years for the actual translation task. Even with this optimistic view, another 5 person years would probably be required for additional development and management tasks, including

Converting existing language resources to a usable form
Selecting an appropriate semantic representation, and extending it as necessary
Developing translation practices appropriate to the task
Providing editing interfaces and other tools to support those who create the translation
Finding like-minded partners with the appropriate skills in Biblical Greek, linguistics, translation, and computer science
Providing mechanisms for distributing the work among collaborators

The effort required will be substantial, comparable at least in scope to that required to translate the New Testament into a new language. However, the value of the resulting resource is well worth the effort. Wycliffe alone plans to translate the Bible into hundreds of additional languages. If SemANT could improve the productivity of current Bible translation practice by even 20%, the return would be positive after 5 more In the case of SemANT, however, the key requirement is not expertise in an exotic and perhaps unwritten foreign language, but facility in creating accurate, consistent semantic representations. Furthermore, unlike traditional Bible translation,

Development Principles

Many other enterprises have developed key technical and procedural components that make this vision more feasible than ever before. Several key principles which build on others' experience in similar endeavors will be essential to accomplishing this ambitious vision.

Collaborate

It seems certain that i do not personally possess the wisdom, technical skill, or resources to complete this project by myself. Nor is that advisable, since "in abundance of counselors there is victory." (Proverbs 24:6). However, it is my earnest intention to make whatever beginnings i can toward accomplishing the vision, in the hope that others will be encouraged by this to work together and say "Let us rise up and build." (Nehemiah 2:18, ESV)

Once the fundmental approach and standards are clearly defined, and sufficient initial examples have been developed, it is possible the work can be divided along the lines of individual books of the New Testament, with some provision for editorial review to ensure consistency. The Internet provides new opportunities for collaborating together despite geographic separation, and appropriate software can make such collaboration both simple and highly productive.

Use Open Standards

We will build on open standards for encoding information to the extent they are adequate for the task. This will enable as many others as possible to benefit from this work, as well as allowing them to investigate and evaluate progress. This includes Internet standards like XML for syntactic structure, RDF for ontologies, and DAML/OWL for additional semantic expressiveness, as well as other Semantic Web activites as they become mature.

Publish and Share Results

As results become available, they will be posted to the Internet where others can examine them, comment on them and offer critique, and use them freely for their own purposes. No commercial licenses, restrictive copyrights or proprietary conditions will limit the sharing of the fruits of this effort, other than those necessary to maintain the integrity of the work.

Make Incremental Progress

After initial development produces a draft specification, work can proceed on a few shorter sections of Scripture, perhaps beginning with a short epistle, and then a gospel. I do not believe all problems can be identified and solved in advance, other than by attempting the work and discovering them along the way. In particular, it may take several revisions to develop an adequate semantic representation language and supporting resources.

The layered semantic approach described above may define an appropriate model for incremental progress, starting with a first pass to provide conceptual annotation, for which significant existing work exists. Subsequent passes can then proceeding to higher-level semantic representation of clausal content, and then relationships between, and more complex attributes of, combinations of clauses.

Focus on Practical Benefit

There are enough difficult technical problems in any task of translation or semantic representation to provide material for dozens of PhD dissertations. Nevertheless, the focus of this project is not theoretical, but practical, and pragmatic concerns will take priority in hopes of producing something of value, however incomplete it might be. We will focus on walking as far down the path as we can see, trusting more light will be given as we do so.

Build on Existing Work

Many brilliant minds have contributed years of research to the understanding of Scripture, the science of Bible translation, semantic formalisms, knowledge representation, etc. Where appropriate approaches or resources already exist, i intend to reuse and adapt them as much as possible, rather than invent for invention's sake, or for pride of ownership.

Resources and References

Tom Pittman's BibleTrans project, now discontinued, first got me thinking about these ideas. The SemANT Project is really half of what BibleTrans was intended to do, the other half being automatic "first draft" translation into target languages which currently lack a Bible translation. Tom's comment on the productivity of his approach (which i hope to investigate more fully):

Experience over an extended passage suggests that it takes a couple days to do one verse right, including peer review and corrections. As we gain more experience we expect both to get faster at this, and also to pick up on more of the subtle issues we are still leaving out at this early stage, so the net effect will still be that it takes an experienced Bible scholar a day or two of full-time work to encode a single verse. Do the math: there are 7,942 verses in the New Testament ...

If you do the math, assuming 1.5 verses per day and 225 work days per year, it would require about 50 person years of effort. This is a daunting estimate, but still not impossible.

OpenCyc is the open source version of Cyc, "the world's largest and most complete general knowledge base and commonsense reasoning engine." It includes an upper ontology of 6000 concepts, and the CycL language for formal knowledge representation.

The Louw-Nida Greek-English Lexicon of the New Testament uses a unique approach to representing lexical knowledge by organizing the words of the Greek New Testament into semantic domains. I hope if possible to obtain an electronic version that can provide the foundational of conceptual terms for SEMANT.

Wordnet is a large on-line lexicon of English with synonym sets and other lexical relations. Here's an example of the Wordnet information about the term "pride."

The Linguistic Data Consortium has been creating large human language resources in text and speech for over a decade, and has repeatedly demonstrated the feasibility of large-scale corpus annotation, as well as developing useful tools and standards.

Lawrence Venuti (ed.) (1992) Rethinking Translation: Discourse, Subjectivity, Ideology, London and New York: Routledge.

Share Your OPML!	1/24/04
www.deangoesnuts.com	1/23/04
Seeds and Treasure	1/19/04
Dean's a believer, yeah, yeah, yeah, yeah, yeah, yeah.	1/11/04
Will the Purpose Driven Church Evangelize This Century?	1/11/04
More on Human Orders of Magnitude	1/11/04
Introducing the Bible to New Christians	1/11/04
Living in the Human Orders of Magnitude	1/10/04
Sharing RSS Feeds	1/10/04
With Gratitude to Howard Ahmanson Jr.	1/10/04
Getting Back to Nature	1/7/04
Information and Motivation	1/6/04
Howard Dean Courts the Religion Vote	1/6/04
Surfing for Faith	1/6/04
Good Information for the Poor	12/24/03
The value of RSS	12/20/03
Reading: Quicksilver	12/14/03
Just a Humble Microbe	12/14/03
Gorey Grammy Grab	12/6/03
The Speed Addiction (Reading: Tyranny of the Moment)	11/26/03
What's Not in a Name	11/24/03