Saturday, July 15, 2006

I've been putting some of the data behind the Hyper-concordance into MySQL, in preparation for computing some statistics on lexical co-occurrence. Along the way, i've been collecting some numbers that i thought others might find interesting. There are a number of other sources for NT statistics: for example, this page from Prof. Felix Just shows words per verse per chapter per book (in the Greek NT).

What's different about the numbers below is that they're based on Hyper-concordance's approach, which groups various inflected forms under their base form (what linguists call a lemma). For example, 'saying', 'says', and 'said' are all pooled under 'say' (as it turns out, the most common lemma in the New Testament, with 1946 occurrences). In the example from the Hyper-concordance home page (Mark.4.24), there are 10 content lemmas (9 of them unique) in this verse of 30 words: "say", "pay", "attention", "hear", "measure" (twice), "use", "still", "more", "add".

  Count Unique
terms 73872 6333
base terms 73872 4526
  name words 6638 593
  non-name words 67234 3933
singletons 1444 1444
  name words 281 281

"Count" is the actual instances, as opposed to the unique values (which we could call the content vocabulary of the New Testament). Some comments:

  • As a textual corpus, the New Testament is relatively small by modern lexico-statistical standards: only about 8000 verses, with a vocabulary of only a few thousand words. I take some consolation from the modest vocabulary size: i'm interested in creating lexical semantics for these terms, and while ~4500 terms is far from trivial, it's not so large as to be completely impossible to consider.
  • "name words" here means nothing more than a word written with a capital letter, about 1 in 10 words, which is actually rather large. I've only found three words that occur both capitalized and uncapitalized. The two obvious ones are God/god and Lord/lord: can you guess the other? (answer at the bottom)
  • the ratio of terms to base terms is really a measurement of the compression induced by the lemmatization approach of the Hyper-concordance. I'd expect this difference to be much larger for a larger corpus.
  • "singletons" here means words which occur exactly once (sometimes called hapax legomena). Clearly there can't be any variation in form here, so the instance and unique counts are the same. This is actually rather small, probably another consequence of the small corpus size: as a rule of thumb, for many large and general corpora, roughly half the words occur only once (though that's words, not lemmas), a consequence of Zipf's Law.
  • the 11 most common words:
    • say (1946 instances)
    • God (1343)
    • come (1120)
    • all (1006)
    • Jesus (964)
    • go (749)
    • man (745)
    • Lord (657)
    • see (622)
    • no (569)
    • know (543)


  • this is all based on the ESV text, your mileage will certainly vary for other translations. You could argue (with some merit) that all such counts should be performed on the Greek text, rather than an English one. However, since the ESV takes an 'essentially literal'  approach, i'd argue that the magnitude will generally be roughly correct, though of course the exact numbers will be slightly different.
  • Of course, these numbers for base forms depend on how you map forms back to their bases: i think my approach is credible, but certainly not perfect (i doubt 'perfect' here could even be well-defined).
  • the Hyper-concordance omits 44 function words that are very common and not very contentful (in information retrieval terms, stop words). I'd argue this is a good thing, but you might think otherwise.

(The second word that occurs in both capitalized and uncapitalized forms is much less obvious, though you'll figure it out if you think a lot about it ...)

5:51:54 PM #  Click here to send an email to the editor of this weblog.  comment []  trackback []