Saturday, January 14, 2006

I spent some time last weekend working on changing the hyper-concordance to a MySql backend. The current implementation simply generates a large number of static HTML files: easy to implement, but a pain to move that much data around. Since there's a file for each term, that's about 3000 files, and 30+ Mb of data. Worse, each verse is repeated for each of its indexed terms:

"For the kingdom of heaven is like a householder who went out early in the morning to hire laborers for his vineyard.(Matt.20.1, RSV)

winds up being stored nine times.

The obviously superior approach, unimplemented not because i'm stupid but because i'm lazy, is to put each verse in a database, create an index of terms to verses, and then serve pages that are generated dynamically and styled on the fly.

But my recent ruminations on Web 2.0 buzz got me thinking that it might be time to try building on the ESV Web Service API instead. Here's an outline of my thinking:

  • Use the same perl code i already have to map inflected terms back to their bases (more about this here: by the way, this is the only thing that seems even modestly new to me about the hyper-concordance)
  • a term request gets mapped into a series of verse requests using doPassageQuery (note i can't use doQuery: that would defeat the mapping back to base forms). Looks like you can retrieve multiple passages by specifying something like "matt.15.23, matt.15.24, matt.15.32, matt.15.39" as the reference (four verses from Matt 15 with different forms of "send").
  • the resulting XML gets tokenized to identify the base terms, and the processed to add in the hyperlinks and some CSS styling (for example, bolding the query term)
  • the results get sent to the browser

Some remaining practical questions:

  • the largest entries in my (yet unreleased) ESV index have hundreds of verses (i think "say" is the current winner). This can be reduced by adding more things to the stopword list, but only at the cost of losing them as hyper-terms. Will the API hold up when queries reference this many passages?
  • Is this acceptably fast?
  • is this well-behaved enough for the daily limit of 500 queries? I don't think i have that many users based on server logs, but i would be nice to be scalable

I'm looking forward to experimenting with this approach: stay tuned.

12:57:28 PM #  Click here to send an email to the editor of this weblog.  comment []  trackback []