Natural Language Processing in Python using NLTK

LinuxFest 2008 Poster

Sean Boisen, <[myfirstname]@logos.com>

LinuxFest Northwest 2008, April 26


Slides at http://semanticbible.org/other/talks/2008/nltk/nltk.html


This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Creative Commons License

But First, a Word from our Sponsors ...

Goals

Intersection of an entire technical field, a sophisticated programming language, and a complex toolkit: i can only do so much.

Intended Audience

Outline

Overview: Why Me?

Overview: Why Python?

Overview: What is Natural Language Processing?

Overview: What is Natural Language Processing? (2)

Why we need NLP

Annoying Questions (Computational) Linguists Hear

Overview: The Natural Language Toolkit

Overview: The Natural Language Toolkit (2)

So What Can You Do With NLTK?

So What Else Can You Do With NLTK?

(We won't have time to cover these today)

NLTK Corpora

Task #1: Process a Corpus

>>> from nltk.corpus import gutenberg
>>> print gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
>>> len(gutenberg.words('chesterton-brown.txt'))
89090

Task #2: Word Frequency Analysis

>>> from nltk.probability import FreqDist
>>> fd = FreqDist()
>>> fd
<FreqDist with 0 samples>
>>> for word in gutenberg.words('chesterton-brown.txt'):
...   fd.inc(word)
...
>>> fd.N()       # number of samples
89090
>>> fd.B()       # number of bins
8839

Task #2: Word Frequency Analysis (2)

>>> fds = fd.sorted()
>>> for word in fds[:10]:
...   print word, fd[word]
...
the 4399
, 4251
. 2889
of 2151
and 2119
a 2103
" 1484
to 1439
in 1233
was 1144

Digression: Zipf's Law

Task #3: Stemming

>>> stemmer = nltk.PorterStemmer()
>>> stemmer.stem('appearance')
'appear'
>>> verbs = ['appears', 'appear', 'appeared', 'appearing', 'appearance']
>>> map(stemmer.stem, verbs)
['appear', 'appear', 'appear', 'appear', 'appear']

Task #4: Part-of-speech Frequency Analysis

>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist, ConditionalFreqDist
>>> fd = FreqDist()
>>> cfd = ConditionalFreqDist()
>>> for text in brown.files():
...   for sent in brown.tagged_sents(text):
...     for (token, tag) in sent:
...       fd.inc(tag)
...       cfd[token].inc(tag)
...
>>> fd['NN']
152470
>>> for pos in cfd['light']:
...   print pos, cfd['light'][pos]
...
VB 9
JJ 60
NN 251

Task #5: Identifying Potential Collocations

Collocation code

from operator import itemgetter
def collocations(words):
    # Count the words and bigrams
    wfd = nltk.FreqDist(words)
    pfd = nltk.FreqDist(tuple(words[i:i+2]) for i in range(len(words)-1))
    
    # score them
    scored = [((w1,w2), score(w1, w2, wfd, pfd)) for w1, w2 in pfd]
    scored.sort(key=itemgetter(1), reverse=True)
    return map(itemgetter(0), scored)

def score(word1, word2, wfd, pfd, power=3):
    freq1 = wfd[word1]
    freq2 = wfd[word2]
    freq12 = pfd[(word1, word2)]
    return freq12 ** power / float(freq1 * freq2)

Collocation Example

>>> file = 'chesterton-brown.txt'
>>> words = [word.lower() for word in gutenberg.words(file) if len(word) > 2]
>>> [w1+' '+w2 for w1, w2 in collocations(words)[:15]]
['father brown', 'project gutenberg', 'pilgrim pond', 'nigger ned', 'martin ward', 'sir claude', 'drugger davis', 'michael hart', 'sir wilson', 'calhoun kidd', 'http ://', 'literary archive', 'archive foundation', 'fund raising', 'thousand pounds']

Useful Utilities: nltk.probability

Useful Utilities: nltk.evaluate

Useful Utilities: nltk.evaluate (2)

c:\Python24\Lib\site-packages\nltk>evaluate.py
---------------------------------------------------------------------------
Reference = ['DET', 'NN', 'VB', 'DET', 'JJ', 'NN', 'NN', 'IN', 'DET', 'NN']
Test      = ['DET', 'VB', 'VB', 'DET', 'NN', 'NN', 'NN', 'IN', 'DET', 'NN']
Confusion matrix:
    | D         |
    | E I J N V |
    | T N J N B |
----+-----------+
DET | 3 0 0 0 0 |
 IN | 0 1 0 0 0 |
 JJ | 0 0 0 1 0 |
 NN | 0 0 0 3 1 |
 VB | 0 0 0 0 1 |
----+-----------+
(row = reference; col = test)
Accuracy: 0.8
---------------------------------------------------------------------------
Reference = set(['VB', 'DET', 'JJ', 'NN', 'IN'])
Test = set(['VB', 'DET', 'NN', 'IN'])
Precision: 1.0
Recall: 0.8
F-Measure: 0.888888888889

Summary

Resources