Blogos

Saturday, July 15, 2006

I've been putting some of the data behind the Hyper-concordance into MySQL, in preparation for computing some statistics on lexical co-occurrence. Along the way, i've been collecting some numbers that i thought others might find interesting. There are a number of other sources for NT statistics: for example, this page from Prof. Felix Just shows words per verse per chapter per book (in the Greek NT).

What's different about the numbers below is that they're based on Hyper-concordance's approach, which groups various inflected forms under their base form (what linguists call a lemma). For example, 'saying', 'says', and 'said' are all pooled under 'say' (as it turns out, the most common lemma in the New Testament, with 1946 occurrences). In the example from the Hyper-concordance home page (Mark.4.24), there are 10 content lemmas (9 of them unique) in this verse of 30 words: "say", "pay", "attention", "hear", "measure" (twice), "use", "still", "more", "add".

	Count	Unique
terms	73872	6333
base terms	73872	4526
name words	6638	593
non-name words	67234	3933
singletons	1444	1444
name words	281	281

"Count" is the actual instances, as opposed to the unique values (which we could call the content vocabulary of the New Testament). Some comments:

As a textual corpus, the New Testament is relatively small by modern lexico-statistical standards: only about 8000 verses, with a vocabulary of only a few thousand words. I take some consolation from the modest vocabulary size: i'm interested in creating lexical semantics for these terms, and while ~4500 terms is far from trivial, it's not so large as to be completely impossible to consider.
"name words" here means nothing more than a word written with a capital letter, about 1 in 10 words, which is actually rather large. I've only found three words that occur both capitalized and uncapitalized. The two obvious ones are God/god and Lord/lord: can you guess the other? (answer at the bottom)
the ratio of terms to base terms is really a measurement of the compression induced by the lemmatization approach of the Hyper-concordance. I'd expect this difference to be much larger for a larger corpus.
"singletons" here means words which occur exactly once (sometimes called hapax legomena). Clearly there can't be any variation in form here, so the instance and unique counts are the same. This is actually rather small, probably another consequence of the small corpus size: as a rule of thumb, for many large and general corpora, roughly half the words occur only once (though that's words, not lemmas), a consequence of Zipf's Law.
the 11 most common words:
- say (1946 instances)
- God (1343)
- come (1120)
- all (1006)
- Jesus (964)
- go (749)
- man (745)
- Lord (657)
- see (622)
- no (569)
- know (543)

Caveats:

this is all based on the ESV text, your mileage will certainly vary for other translations. You could argue (with some merit) that all such counts should be performed on the Greek text, rather than an English one. However, since the ESV takes an 'essentially literal' approach, i'd argue that the magnitude will generally be roughly correct, though of course the exact numbers will be slightly different.
Of course, these numbers for base forms depend on how you map forms back to their bases: i think my approach is credible, but certainly not perfect (i doubt 'perfect' here could even be well-defined).
the Hyper-concordance omits 44 function words that are very common and not very contentful (in information retrieval terms, stop words). I'd argue this is a good thing, but you might think otherwise.

(The second word that occurs in both capitalized and uncapitalized forms is much less obvious, though you'll figure it out if you think a lot about it ...)

5:51:54 PM #

comment [] trackback []

July 2006
Sun	Mon	Tue	Wed	Thu	Fri	Sat
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31
Jun Aug

Lexical vs. Conceptual Semantics for Humility	8/8/06
xpound.org and Web 2.0 Bible tagging	8/1/06
Search Interfaces for the Composite Gospel	7/29/06
Topic Labels and Semantic Bible Search	7/18/06
Lexical Statistics in the New Testament	7/15/06
NY Times Article on Parts of Speech	7/13/06
Minor Update to Hyper-Concordance	7/4/06
Disambiguating Names in the New Testament	6/28/06
Portable Hyper-Concordance	6/25/06
Popular Hyper-Concordance Searches	6/3/06
Playing 24	5/30/06
LibraryThing and why Web 2.0 is Great	5/23/06
Doc Searls and the Generosity of Morality	5/23/06
New ESV Hyper-Concordance Released	5/14/06
Google Homepage Daily Bible Verse	5/12/06
More on RDF Bible Vocabulary	4/28/06
A Draft RDF Vocabulary for Bible Books	4/23/06
Holy Week: Wednesday	4/12/06
Holy Week: Tuesday	4/11/06
Holy Week: Monday	4/10/06
Holy Week: Sunday	4/9/06