- convert everything to lowercase
- remove punctuation at the start and end of words
- leave apostrophes and hyphenated words intact
This function has 2 while loops. The first drops the first character of the word (using the string slice syntax) until the first character is valid. The second loop does the same, but works from the last character.
Counting word frequencies—the number of times a word occurs in the text—is the main tool offered by this program to analyze text. The algorithm in ``freq`` is familiar to us by now: iterate over a list of words in a for loop, use unique words as **dictionary keys** and keep a **running total** of each word as the dictionary **value**. The word frequency map return allows quick and easy access to the frequency of every word in the text. Dictionaries are great for keyed access to items—in our case, looking up word counts based on the word—but they are inherently **unsorted**. This means that we need to take extra steps to see the most common or least common words in a text, or to even display the words in alphabetical order. Previously we have seen how to display the contents of a dictionary by doing a **for loop** over the ``sorted(map.keys())``. This makes sense for some data, but here we are more interested in sorting on the values, not the dict keys.freq_list
solves this problem by sorting the dictionary
items based on the word count. It expects a list of tuples
with element-zero holding the word and item-1 holding the count.
Luckily, this is the tuple list we get when we call items()
on our word frequency map. Python offers many flexible ways
of doing custom sorts on data. We choose a method here that
builds on programming skills we have already learned:
- swap the tuples from (word, count) using a for loop
- sort them using the
method which is part of the list object - use the optional
parameter tosort
so that we get the most popular words at thehead
of our list - swap the items back into a list of (word, count) tuples and return that, pretending nothing ever happened (actually, this is a type of abstraction, where we can later change the way we choose to sort the items without changing the interface to our function)
# schools_analysis.py
# by: mxc
This example code shows how to use
some basic content analysis techniques
such as finding word frequencies
and counting neighboring words to
provide a basic analysis of differences
in New York Times articles from 1983
and 2013.
Key Concepts: functions, dictionaries,
lists, tuples, string operations
New Concepts: while loop
def read_and_prep(fileName):
Open a file, read the content,
and prepare it for analysis.
Return a list of words from the
f = open(fileName,"r")
text = f.read()
words = clean(text)
return words
def clean(text):
Prepare our text for analysis by
removing non-alphanumeric
characters like punctuations and
formatting marks line em and en
dashes. This function assumes
ascii not UTF-8 or other
text = text.lower()
# replace -- with a space
text = text.replace("--", " ")
words = text.split()
cleanWords = []
for word in words:
word = strip_non_chars(word)
if len(word) > 0:
return cleanWords
def strip_non_chars(word):
Use index notation to make sure that the first
and last character of the word is one of our
valid characters
valid = "abcdefghijklmnopqrstuvwxyz1234567890"
while len(word) > 0 and word[0] not in valid:
word = word[1:]
while len(word) > 0 and word[len(word) - 1] not in valid:
word = word[0: len(word) - 1]
return word
def freq(words):
Take words--a list of words--and
break it down into a word
frequency map where each unique
word in the text is a key and
with the number of occurrences
as the value. Return the map.
freqMap = {}
for word in words:
if word in freqMap:
freqMap[word] += 1
freqMap[word] = 1
return freqMap
def freq_list(items):
Create a sorted list from map items
from a word frequency dictionary.
The items must be 2-tuples in the format
(word, count). Return a list of 2-tuples
in (word, frequency) order, sorted with
the most frequent words at the start
of the list.
freqList = []
for word, count in items:
freqList.append( (count, word) )
# now swap it back to word,count order in our tuples
swapped = []
for count,word in freqList:
swapped.append( (word,count) )
return swapped
def common_filter(items):
Filter out the most common
English words from from a list
of 500 common words. ``items``
is a list of tuples where the
first element is the word.
Returns a new list of items,
without the common words.
# 500 common English words, in frequency order
# edited here for brevity
commonWords = ['the', 'of', 'and', 'a', 'in'] #...
filtered = []
for item in items:
word = item[0]
if word not in commonWords:
filtered.append( item )
return filtered
def neighbors(words, targetWords, n):
This function takes a list of
words and returns a dict with
each word as the key, and a
sorted list of frequency
2-tuples in the form (word,
freq) with all of the words in
the text within ``n`` spaces
either before or after the word.
neighborMap = {}
#initialize our dict with empty lists
for target in targetWords:
neighborMap[target] = []
for i in range(len(words)):
# current word
word = words[i]
if word in targetWords:
# use max to guard against
# going beyond end of list
start = max(0,i-n)
# use min to make sure
# we don't go below zero
end = min(len(words)-1, i+n)
neighborList = neighborMap[word]
neighborList.extend(words[start:i] + words[i+1:end+1])
neighborMap[word] = neighborList
for key in neighborMap.keys():
neighborMap[key] = freq(neighborMap[key])
return neighborMap
def compare(a, b):
Compare the word frequency dictionaries
``a`` and ``b`` by created a new sorted list
of 2-tuples in the format
(word, difference in counts).
wordSet = set(a.keys())
freqDifList = []
for word in wordSet:
aCount = a.get(word, 0)
bCount = b.get(word, 0)
dif = abs(aCount - bCount)
freqDifList.append( (word, dif) )
sortedDifs = freq_list(freqDifList)
return sortedDifs
def analyze(fName, targets):
Analyze the texts along several lines:
- split into words
- create a frequency map for the document
- analyze neighbors for the
list of ``target`` words
Return the results as a tuple:
(list of all words, freq map, freq map for neighbors)
words = read_and_prep(fName)
frequencies = freq(words)
neigh = neighbors(words, targets, 2)
return words, frequencies, neigh
def print_counts(freqList):
Takes a list of tuples in the format (word, count)
and prints the result in a table.
col = 20
print("word".ljust(col) + " count")
print("-" * col + " " + "---------")
for word, count in freqList:
print(str(word.rjust(col)) + " " + str(count))
def print_targets(neigh, count=10):
for target in neigh.keys():
com = freq_list(neigh[target].items())
com = common_filter(com)
show = min(count, len(com)-1 )
def report(header, words, freqList, neighborList):
Print a generic report with the analysis.
print("RUNNING ANALYSIS FOR:", header)
print("Total words:", len(words))
uncommon = common_filter(freqList)
def print_compare(compared, a, b):
Print out a table comparing two
dictionaries. ``Compared`` is the
ordered list of keys-freq to compare.
``a`` and ``b`` are the two word
frequency dictionaries to compare.
col = 20
print("word".ljust(col) + " a list b list")
print("-" * col + " ------ ------")
for word, dif in compared:
print(word.rjust(col) + " " +
str(a.get(word,0)).ljust(6) + " " +
is more complicated than other code we have looked at
because it calls one of our functions in a loop and it uses nested
dictionaries --- a dictionary that has more dictionaries as data.
To unpack the function a little bit, consider this example text:
While previously well known in education circles, she gained a much broader audience after she publicly rejected almost everything she had once believed. In a surprise 2010 best seller, "The Death and Life of the Great American School System," she openly declared that she had been wrong to champion standardized testing, charter schools and vouchers. She says she is trying now to make up for past errors.
If we run neighbors
with targetWords=["testing", "education"], n=2
we would create a dictionary that looks something like this:
key value
education {"known": 1, "in": 1, "circles": 1, "she": 1} testing {"champion": 1, "standardized": 1, "charter": 1, "schools": 1}
Here are the actual results for this neighbors call against the 'teachers_2013.txt'; showing the top 3 neighbors for each word.
key value
testing {"standardized": 30, "high-stakes": 15, "students": 8} education {"department": 169, "board": 54, "higher": 52}