Category: Python

TF-IDF in Python

I am trying to process a bunch of text generated by human users to figure out what they are talking about, in the context of an experiment where there were robots a person could be controlling, a few target areas the robots could move to, and a few things they could move, like a crate. One thing that occurred to me was to use TF-IDF, which, given a text and a collection of texts, tells you what words in the text are relatively unusual to that particular text, compared to the rest of the collection.

It turns out that that’s not really what I wanted, because the words that are “unusual” are not really the ones that the particular text is about.

select(1.836) all(3.628) red(2.529) robots(1.517) ((1.444) small(5.014) drag(2.375) near(3.915) lhs(4.321) of(1.276) screen(2.018) )(1.444)

This is a sentence from the collection, and the value after each word is the TF-IDF score for that word. The things I care about are that it’s about robots, and maybe a little that it’s about the left hand side of the screen. “Robots” actually got a pretty low score (barely more than “of”), but “small” got the highest score in the sentence.

At any rate, this is how I did my TF-IDF calculation. Add documents to the object with add_text, get scores with get_tfidf. It uses NLTK for tokenization, you could also use t.split(” “) to break strings up on spaces.

class TFIDF(object):
  def __init__(self):
    #Count of docs containing a word
    self.doc_counts = {}
    self.docs = 0.0

  def add_text(self, t):
    #We're adding a new doc
    self.docs += 1.0
    #Get all the unique words in this text
    uniques = list(set(nltk.word_tokenize(t)))
    for u in uniques:
      if u in self.doc_counts.keys():
        self.doc_counts[u] += 1
      else:
        self.doc_counts[u] = 1

  def get_tfidif(self, t):
    word_counts = {}
    #Count occurances of each word in this text
    words = nltk.word_tokenize(t)
    for w in words:
      if w in word_counts.keys():
        word_counts[w] += 1
      else:
        word_counts[w] = 1
    #Calculate the TF-IDF for each word
    tfidfs = []
    for w in words:
      #Word count is either 0 (It's in no docs), or the count
      w_docs = 0
      if w in self.doc_counts.keys():
        w_docs = self.doc_counts[w]

      #the 1 is to avoid div/zero for previously unseen words
      idf = math.log(self.docs/(1+w_docs))
      tf = word_counts[w]
      tfidfs.append((w, tf * idf))
    return tfidfs

 

Download ALL The Music

Given a file containing a list of songs, one per line, in the format “Artist – Song Title”, download the audio of the first youtube video link on a Google search for that song. This is quite useful if you want to the MP3 for every song you ever gave a thumbs up on Pandora. On my computer, this averages about 4 songs a minute.

The Requests API and BeautifulSoup make writing screenscrapers and automating the web really clean and easy.

#!/usr/bin/python

# Takes a list of titles of songs, in the format "artist - song" and searches for each
# song on google. The first youtube link is passed off to youtube-dl to download it and 
# get the MP3 out. This doesn't have any throttling because (in theory) the conversion step
# takes enough time to provide throttling. 

import requests
import re
from BeautifulSoup import BeautifulSoup
from subprocess import call

def queryConverter(videoURL):
	call(["youtube-dl", "--extract-audio",  "--audio-format", "mp3", videoURL])

def queryGoogle(songTitle):
	reqPreamble = "https://www.google.nl/search"
	reqData = {'q':songTitle}
	r = requests.get(reqPreamble, params=reqData)
	if r.status_code != 200:
		print "Failed to issue request to {0}".format(r.url)
	else:
		bs = BeautifulSoup(r.text)
		tubelinks = bs.findAll("a", attrs={'href':re.compile("watch")})
		if len(tubelinks) > 0:
			vidUrl = re.search("https[^&]*", tubelinks[0]['href'])
			vidUrl = requests.utils.unquote(vidUrl.group(0))
			return vidUrl
		else:
			print "No video for {0}".format(songTitle)

if __name__=="__main__":
	with open("./all_pandora_likes", 'r') as inFile:
		for line in inFile:
			videoURL = queryGoogle(line)
			if videoURL is not None:
				queryConverter(videoURL)

PDB for n00bs

PDB is the python debugger, which is very handy for debugging scripts. I use it two ways.

If I’m having a problem with the script, I’ll put in the line

import pdb; pdb.set_trace()

just before where the problem occurs. Once the pdb line is hit, I get the interactive debugger and can start stepping through the program and seeing where it blows up, and what variables are getting set to before that happens.

However, I recently found a very handy second way. I was debugging a script with a curses interface, which cleans up when it exits. Unfortunately, that cleanup means that my terminal gets wiped when something crashes, so instead of a stack trace, I just get dumped back to the terminal when something goes wrong, with no information at all left on the screen.

Invoking the script with

python -m pdb ./my_script.py

gets me the postmortem debugger, so when something goes wrong, the program halts and I get the interactive debugger and some amount of stack trace. It’s messy looking because of curses, but I can at least see what is going on.