Blogos

The New Testament Hyper-concordance

This began as a programming exercise try some corpus linguistics techniques with the Open Scripture Information Standard, a recent XML standard for encoding Bible texts. I wanted to see what i could learn from more thoroughly connecting the words of Scripture together. You can jump right in and explore through the index: this article provides background about what, why, and how.

What's a "hyper-concordance"?

The basic idea is to navigate the space of Scripture directly using words. Most Scripture websites have a search box where you enter a word to find verses that use that word. For example, searching the English Standard Version New Testament for the word "pots" finds two verses, Mark 7:4 and Revelation 2:27 (you need to use the advanced search and select "Exact matches only"). From the standpoint of connecting information, this provides a link from a single word to one or more verses of Scripture.

Taking this idea one step further, given the text of the verse, you can just embed a hyperlink from the word in question to other verses, preserving the context. Now here's where the idea takes off: instead of just hyperlinking one word, suppose every word is hyperlinked? This more tightly connects the information and gets you directly from the context of one verse to another with similar content (because of similar words). With some special processing to index the words, every word can link to a list of verses, each word of which is in turn hyperlinked to others, each word of which ... you get the idea.

Here's an example from the page for "Scripture" (the links are live into Hyper-concordance):

2Tim.3.16 All scripture is inspired by God and profitable for teaching, for reproof, for correction, and for training in righteousness,

The word "scripture" isn't hyper-linked, since that would take you back where you already are. This is the only occurrence of the words "reproof" and "correction" in the New Testament, so there's no benefit in linking these: you'd only get to the same verse. The other unlinked words are high-frequency function words: they could be linked, but there would be little added value, and it would take a lot more space (the entire hyper-concordance as static HTML only amounts to about 30Mb).

Inflected verbs and plural nouns are linked to their base forms (in this example, "inspired" -> "inspire", "training" -> "train"). Most other Scripture search engines i've seen either match exactly (treating "inspired" and "inspire" as two different words), match substrings (for "pot", this has the peculiar result of matching "spots" and "Mesopotamia"!), or match from the beginning of the word ("pot" matches "pots", "potter", etc.). I wanted to try to do a better job about matching the real dictionary form of words (more about that below).

I've seen this approach used for dictionaries like the HyperDictionary, but this is the only example of a Scripture concordance that i know of (email me if you have so i can give credit where due).

Creating the Hyper-concordance

The hyper-concordance is programmed in Perl using the XML::Twig module for XML parsing of the OSIS sources: you're free to download the source, and it would be easy to adapt to other uses. The input is simply the OSIS version of the RSV text, with the Old Testament portion removed by hand. The parsing is completely trivial: tokenizing turned out to be harder, though. The RSV indicates the pronunciation of proper names like She-al'ti-el: i didn't want to treat this as 5 tokens, but i did want to split "self-control" into its components. In this one case, i modified the original text, to remove the punctuation marks: otherwise, everything came straight out of the original. I tokenized both words and white space and punctuation, so that concatenating the raw tokens would recreate the original text exactly.

The algorithm simply goes through each word of each verse (other than stopwords), and creates a hash table mapping base forms to the verses they occur in. A static HTML page is generated for each term and placed in a directory under the initial letter to keep things more manageable. The same approach is used to generate the index page, which i find interesting all on its own (but then, i enjoy reading dictionaries too!).

Linguistic Issues

As indicated above, i wanted to map inflections and plurals back to their dictionary forms. As a practicing computational linguist, it was tempting to use language processing smarts to figure this out, and that's still one way to go. But there's one significant advantage of the New Testament as a subject of corpus analysis: the vocabulary is fixed, and, by most modern standards, remarkably small. I was surprised to find the 8k verses of the New Testament amount to only about 175k words, with a vocabulary size of less than 7k (the exact numbers depend on how you count, of course). Here's the entire vocabulary list (including inflected verbs and plurals) with counts: as is typical of most natural language problems, vocabulary items aren't normally distributed, but follow Zipf's Law, a topic for another day. By comparison, the University of Pennsylvania Treebank, the foundation of most recent work on trained parsers, contains 1M words of Wall Street Journal text, and these days corpora of 10s of millions of words aren't unusual. With a vocabulary this small, it's entirely feasible to review the whole list in a reasonable amount of time.

So i elected to simply create a list mapping inflected forms and plurals to their bases. I probably missed a few, but it only took a couple of hours. Of course, i'd have to re-do this work for another translation, which would make a more principed approach more attractive. But i doubt any morphological parser for English would be able to tell me that "besought" is the past tense of "beseech"!

Another problem with this approach is that it's hard to know how far to go. Plurals are easy, though i didn't map those without a corresponding singular (like "pangs"). I wasn't completely consistent with words that could be either derived nouns or -ing verbs: should "saying" go back to "say", or stand on its own? And in a few genuinely ambiguous cases, i wound up conflating things i would have rather kept separate: is "lives" the plural of "life", or the inflected form of "live"? It requires more serious processing and understanding context to get these correct. In general i preferred to group things rather than leave them separate, unless there was good reason to do otherwise. I also decided to stop short of mapping superlatives like "better" and "best" back to "good", "happiness" back to "happy", and so forth.

Conclusions

Is this useful? Well, i hope so, but i'm not sure (at least i find it interesting). This is not a general Bible search tool, and i have no plans to make it one. It doesn't cover the whole Bible (and would be unwieldy for some terms if it did), doesn't let you restrict search to specific sections, etc. But i'm already thinking of ways to extend this. For example, you could view the links between verses as defining a graph, with the strength of relationship determined by how many words are linked in each, and how frequently these words occur in the corpus. It might make an interesting link diagram, though i don't have the software tools to generate one. Another possibility is to add links to each term for other terms with "close" semantics: a topic for another day, though the combination of Nave's Topical Bible and statistical learning techniques might provide for some interesting experiments.

Having created this, i'm very interested to know what people think about it. Please email me with feedback, errors, or constructive criticism.

Share Your OPML!	1/24/04
www.deangoesnuts.com	1/23/04
Seeds and Treasure	1/19/04
Dean's a believer, yeah, yeah, yeah, yeah, yeah, yeah.	1/11/04
Will the Purpose Driven Church Evangelize This Century?	1/11/04
More on Human Orders of Magnitude	1/11/04
Introducing the Bible to New Christians	1/11/04
Living in the Human Orders of Magnitude	1/10/04
Sharing RSS Feeds	1/10/04
With Gratitude to Howard Ahmanson Jr.	1/10/04
Getting Back to Nature	1/7/04
Information and Motivation	1/6/04
Howard Dean Courts the Religion Vote	1/6/04
Surfing for Faith	1/6/04
Good Information for the Poor	12/24/03
The value of RSS	12/20/03
Reading: Quicksilver	12/14/03
Just a Humble Microbe	12/14/03
Gorey Grammy Grab	12/6/03
The Speed Addiction (Reading: Tyranny of the Moment)	11/26/03
What's Not in a Name	11/24/03