Thursday, August 31, 2006

I've moved to a new blogging platform (goodbye Radio Userland, hello WordPress).

But if you read through an RSS aggregator (this is really important, so pay attention):

If you read directly from the website, everything will work as before at my preferred URL, http://www.semanticbible.com/blogos/. The new site includes several syndication buttons that make it easy to add Blogos to your Bloglines, MyYahoo!, or other readers.

If you have any problems with this, please send me (sean) an email at semanticbible daht com. I don't want to lose any readers in the transition (there aren't that many to start with!).


7:49:44 AM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Tuesday, August 08, 2006

In a comment on my recent thoughts on semantic search,  Matt asks a reasonable question: "Wouldn't Louw-Nida help?" Since i've recently gotten a copy of Logos 3 Scholar's Library: Silver (i'll have a lot more to say about that later, but here's the preview: it's a fantastic resource), i tried it out. For this particular question, the answer appears to be no.

Humility is under 88/G, Moral and Ethical Qualities and Related Behavior/Humility (note this is a conceptual label for the passage: the word humility doesn't actually occur). Related words here would include:

  • lord (as in "lord it over"): 37/D, Control, Rule/Rule, Govern.
  • exercise authority: same domain and subdomain
  • servant/serve: 35/B, Help, Care For/Serve
  • slave: either the same subdomain as "lord [it over]", the more figurative sense, or more literally as 87/E, Status/Slave, Free

This isn't too surprising: Louw-Nida is a lexical resource, but the fundamental issue here (and the point of my post) is that there are lots of significant semantic concepts above the level of words. That's exactly what makes notions like "topic" slippery in practice.


9:09:23 AM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Saturday, July 29, 2006

I'm preparing a new version of the Composite Gospel Index pages, to standardize around the ESV text, and hopefully provide both more usability and more visual appeal. Designing an interface for this data poses some interesting challenges. There's a wealth of different attributes available, and while some (like traditional verse references) are familiar to most Bible students, i'm hoping to get outside the box a bit and do some novel things.

The whole point of the Composite Gospel is to provide a different way to look at the story of Jesus' life, in particular one that is more oriented around stories, many of which are common to multiple Gospels, and to show how they fit into the whole. So i'm hoping to reinforce this in the new interface. Right now there are two ways to access the Composite Gospel, the typical entry point being the Pericope Index, a traditional single static page listing the pericope ID, title, and references, with hyperlinks to the content pages. It's got a number of faults:

  • as soon as you click through to an individual pericope (here's Pericope 118, Jesus sends out the twelve disciples), you're back to looking through a keyhole, without the view of the whole sequence. It would be better to have a view of the whole index alongside the content for a selected pericope.
  • there's no help for finding pericopes with specific titles or Scriptural references (other than browser search)
  • while you can easily see how many sources are behind a given pericope (it's just a matter of how many columns are filled in its row in the table), the significance (as evidenced by size) is buried. Pericope 153: Jesus teaches about forgiving othersis only two verses: the next one, Pericope 154: Jesus tells the parable of the unforgiving debtor, has 13 verses. But there's no visual clues to this in the index.
  • let's face it, it's just ugly :-/

The individual pages themselves have different navigational elements: next/previous pericope, and also next/previous for a given Gospel author. These are okay as far as they go: my major complaint is they don't go far enough. I'm also hoping to add more supplemental information:

  • other pericopes with similar topics or content. For example, though i consider the cleansing of temple early in John (Pericope 031: Jesus clears the template) to be different from the one during the Passion Week (Pericope 249: Jesus clears the template again), clearly one ought to have a "see also" link to the other.
  • a list of names in the pericope in view, with navigation to other pericopes which mention the same name

It will be a while before i can do all this, though!

I've been searching for some time for the right visual metaphor (and corresponding interface code) to provide a much more visual index to replace the current text-heavy index. It would be great if you could scan a clear visualization of which authors covered a particular story, and how much content there is for it (number of tokens). Likewise, when you've selected an individual pericope, you should have a clear view of where it fits into the entire sequence.

A picture named pericopes-sources-by-token-count.jpg

In preparing for this, i got interested in the distribution of sources (an individual author's version) by their size. This graph shows that, binned in groups of 10: the black trend line smooths this a little further with a moving average (window of 3).  There's quite a bit of variety (no surprise), ranging from a single source with just 9 tokens (Luke's description of the beginning of Jesus' Galilean preaching ministry, " And he was preaching in the synagogues of Judea.", Pericope 048: Jesus preaches throughout Galilee), to a single source with 566 tokens (Pericope 119: Jesus prepares the disciples for persecution, found in Matthew). But there's some approximation of a normal distribution (with an elongated tail on the high side), and clearly the bulk have from 30 to perhaps 270 tokens, with values near the median of around 30-40 instances (since i'm binning, this number itself isn't very meaningful). This suggests the cases i need to optimize for: i should be able to fit up to about 270 token displays on something close to a single page view (these days that really means 1024 x 768 pixels, though surprisingly i still get 15-20% of my visits from people with 800x600 displays).

 Ultimately, i'd love to have a rich treemap interface to support exploring the data in a variety of different ways (this was the substance of my presentation at the Society for Biblical Literature last year). As publisher Tim O'Reilly notes in a recent post, treemaps are really made to be interfaces, not graphs: their power lies in your ability to interact with them to explore the data. Unfortunately, i don't know how to do this live on my website: i don't have permission to host the Treemap software i use myself from the University of Maryland, and i don't know of a good substitute (O'Reilly's post is about a Rails implentation, but that's outside my current scope).


2:47:48 PM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Tuesday, July 18, 2006

But Jesus called them to him and said, "You know that the rulers of the Gentiles lord it over them, and their great ones exercise authority over them. It shall not be so among you. But whoever would be great among you must be your servant, and whoever would be first among you must be your slave, even as the Son of Man came not to be served but to serve, and to give his life as a ransom for many." (Matt.20.25-28)

I've been thinking about topic labels for Scripture passages lately: a deceptively simple idea that's quite hard to nail down. The notion of topic includes many different things: a person might be a topic (Jesus talks about John the Baptist in Luke.7.24-30), but every mention of a person probably isn't a topic in quite the same sense (the same passage mentions the Pharisees, but the passage isn't really about them, it simply mentions them). Sometimes key words and phrases are topics ("luxury" is a word in the same passage, and a relatively distinct one at that: it only occurs 4 times in the New Testament). But if that's what you mean by a topic, then word searches will usually find what you want. The toughest cases (and therefore the most interesting ones) are when you don't have a distinctive lexical item for a topic decision.

The classic Librarian Problem is that whatever i call a topic may have different meaning to someone else, or fall outside the conceptual schema they're using for searching (Shirky has a nice overview of this). The kind of folksonomic tagging popularized by del.icio.us works well at a personal level (i know what my "facets" tag means to me, even though you may not), and it works well at the larger level because enough others might happen to use the same tags that aggregation adds value. I expect this kind of tagging for Scripture will start to show up in some interesting ways in the next year under the Web2.0 rubric.

A picture named 076422560X.01.LZZZZZZZ-thumb.jpgHere's what got me thinking about this: i was reading Humility by Andrew Murraythis morning (highly recommended, by the way), and he discusses the passage above as an example of Jesus' teaching about humility. I'd agree (as would Naves, and most other topic-oriented indexes): but if you wanted to label such passages in some automated fashion, what evidence would you use? The words "humble" and "humility" are nowhere to be found, and neither are their direct antonyms like "proud". Jesus mentions the contrasting examples of Gentiles who "lord it over them" and others who "exercise authority over them": but these complex semantic constructs aren't easy to take apart (and the first one isn't very typical English: the Contemporary English Version's translation of "order their people around" is arguably more natural). Certainly being the servant of others implies the personal trait of humility, but the relationship is quite abstract.

Just another argument for why this kind of annotation of Scripture will probably be done the old-fashioned way (by hand) for the foreseeable future ...


7:49:55 AM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Saturday, July 15, 2006

I've been putting some of the data behind the Hyper-concordance into MySQL, in preparation for computing some statistics on lexical co-occurrence. Along the way, i've been collecting some numbers that i thought others might find interesting. There are a number of other sources for NT statistics: for example, this page from Prof. Felix Just shows words per verse per chapter per book (in the Greek NT).

What's different about the numbers below is that they're based on Hyper-concordance's approach, which groups various inflected forms under their base form (what linguists call a lemma). For example, 'saying', 'says', and 'said' are all pooled under 'say' (as it turns out, the most common lemma in the New Testament, with 1946 occurrences). In the example from the Hyper-concordance home page (Mark.4.24), there are 10 content lemmas (9 of them unique) in this verse of 30 words: "say", "pay", "attention", "hear", "measure" (twice), "use", "still", "more", "add".

  Count Unique
terms 73872 6333
base terms 73872 4526
  name words 6638 593
  non-name words 67234 3933
singletons 1444 1444
  name words 281 281

"Count" is the actual instances, as opposed to the unique values (which we could call the content vocabulary of the New Testament). Some comments:

  • As a textual corpus, the New Testament is relatively small by modern lexico-statistical standards: only about 8000 verses, with a vocabulary of only a few thousand words. I take some consolation from the modest vocabulary size: i'm interested in creating lexical semantics for these terms, and while ~4500 terms is far from trivial, it's not so large as to be completely impossible to consider.
  • "name words" here means nothing more than a word written with a capital letter, about 1 in 10 words, which is actually rather large. I've only found three words that occur both capitalized and uncapitalized. The two obvious ones are God/god and Lord/lord: can you guess the other? (answer at the bottom)
  • the ratio of terms to base terms is really a measurement of the compression induced by the lemmatization approach of the Hyper-concordance. I'd expect this difference to be much larger for a larger corpus.
  • "singletons" here means words which occur exactly once (sometimes called hapax legomena). Clearly there can't be any variation in form here, so the instance and unique counts are the same. This is actually rather small, probably another consequence of the small corpus size: as a rule of thumb, for many large and general corpora, roughly half the words occur only once (though that's words, not lemmas), a consequence of Zipf's Law.
  • the 11 most common words:
    • say (1946 instances)
    • God (1343)
    • come (1120)
    • all (1006)
    • Jesus (964)
    • go (749)
    • man (745)
    • Lord (657)
    • see (622)
    • no (569)
    • know (543)

Caveats:

  • this is all based on the ESV text, your mileage will certainly vary for other translations. You could argue (with some merit) that all such counts should be performed on the Greek text, rather than an English one. However, since the ESV takes an 'essentially literal'  approach, i'd argue that the magnitude will generally be roughly correct, though of course the exact numbers will be slightly different.
  • Of course, these numbers for base forms depend on how you map forms back to their bases: i think my approach is credible, but certainly not perfect (i doubt 'perfect' here could even be well-defined).
  • the Hyper-concordance omits 44 function words that are very common and not very contentful (in information retrieval terms, stop words). I'd argue this is a good thing, but you might think otherwise.

(The second word that occurs in both capitalized and uncapitalized forms is much less obvious, though you'll figure it out if you think a lot about it ...)


5:51:54 PM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Tuesday, July 04, 2006

I noticed quite a few errors in the server log for hyper-concordance hits where the folder name (an initial) was capitalized: this suggested that some browser/user agents were having case-sensitivity issues. So i've posted a minor revision to upper-case the folder names: if you have trouble, please let me know.


7:17:41 PM #  Click here to send an email to the editor of this weblog.  comment []  trackback []
 Wednesday, June 28, 2006

I'm still working away (far too slowly for my impatient tastes) on the first complete first of New Testament Names, a semantic knowledgebase of named things in the New Testament and their relationships: you can get a sense of it from these representations of a browser prototype. But i'm also looking beyond to what will come next (one reason these projects take too long! I keep starting new ones ...).

After cataloging the names and their information, clearly the next step is to add Scriptural references. The first pass here can be done automatically (it's largely just string matching). But of course, there are a lot of different Johns, Marys and Simons in the New Testament, and it's a lot more useful to know which one is which: this is something people do so easily they hardly recognize it, but it can be surprisingly tough to do automatically.

As an example, there are 36 mentions of "Joseph" in the ESV NT text (in 35 verses: Acts.7.13 mentions him twice). Obviously in the birth narratives of Jesus, Joseph refers to Jesus' (earthly) father: by my count, that's 14 of the 36 references. Joseph of Arimathea is mentioned in the Passion narratives, since he provided a tomb for Jesus to be buried in: 7 of the 36 are this Joseph. As an aside, here's a case that requires a little more than string matching: Luke.23.50, "a man named Joseph, from the Jewish town of Arimathea". You'd need to be pretty smart about the use of context to figure out which Joseph this is. 

For other cases like John.4.5, where the mention of Joseph refers to the Old Testament figure, only real human understanding of the text can determine the correct reference. This Joseph is more frequent outside the Gospels (though he's in Luke's genealogy), 10 of the NT Josephs in all. There are also two references in Acts to Joseph who the apostles nicknamed Barnabas, and then a few others: Jesus' brother (Matt.13.55, perhaps also Matt.27.56), and two Josephs in Luke's genealogy (Luke.3.24 and Luke.3.30).

My guess is maybe 80% of the name references in the New Testament can be easily disambiguated, either because they're not ambiguous in the first place, or because simple heuristics clarify them. But for the rest, somebody will actually have to look at them and make a decision (in a very few cases, some really hard decisions). Maybe Amazon's Mechanical Turk is an appropriate mechanism: the ESV blog reports on an interesting experiment here.


7:50:22 AM #  Click here to send an email to the editor of this weblog.  comment []  trackback []