Category: Scripting

TF-IDF in Python

I am trying to process a bunch of text generated by human users to figure out what they are talking about, in the context of an experiment where there were robots a person could be controlling, a few target areas the robots could move to, and a few things they could move, like a crate. One thing that occurred to me was to use TF-IDF, which, given a text and a collection of texts, tells you what words in the text are relatively unusual to that particular text, compared to the rest of the collection.

It turns out that that’s not really what I wanted, because the words that are “unusual” are not really the ones that the particular text is about.

select(1.836) all(3.628) red(2.529) robots(1.517) ((1.444) small(5.014) drag(2.375) near(3.915) lhs(4.321) of(1.276) screen(2.018) )(1.444)

This is a sentence from the collection, and the value after each word is the TF-IDF score for that word. The things I care about are that it’s about robots, and maybe a little that it’s about the left hand side of the screen. “Robots” actually got a pretty low score (barely more than “of”), but “small” got the highest score in the sentence.

At any rate, this is how I did my TF-IDF calculation. Add documents to the object with add_text, get scores with get_tfidf. It uses NLTK for tokenization, you could also use t.split(” “) to break strings up on spaces.

class TFIDF(object):
  def __init__(self):
    #Count of docs containing a word
    self.doc_counts = {}
    self.docs = 0.0

  def add_text(self, t):
    #We're adding a new doc
    self.docs += 1.0
    #Get all the unique words in this text
    uniques = list(set(nltk.word_tokenize(t)))
    for u in uniques:
      if u in self.doc_counts.keys():
        self.doc_counts[u] += 1
      else:
        self.doc_counts[u] = 1

  def get_tfidif(self, t):
    word_counts = {}
    #Count occurances of each word in this text
    words = nltk.word_tokenize(t)
    for w in words:
      if w in word_counts.keys():
        word_counts[w] += 1
      else:
        word_counts[w] = 1
    #Calculate the TF-IDF for each word
    tfidfs = []
    for w in words:
      #Word count is either 0 (It's in no docs), or the count
      w_docs = 0
      if w in self.doc_counts.keys():
        w_docs = self.doc_counts[w]

      #the 1 is to avoid div/zero for previously unseen words
      idf = math.log(self.docs/(1+w_docs))
      tf = word_counts[w]
      tfidfs.append((w, tf * idf))
    return tfidfs

 

Download ALL The Music

Given a file containing a list of songs, one per line, in the format “Artist – Song Title”, download the audio of the first youtube video link on a Google search for that song. This is quite useful if you want to the MP3 for every song you ever gave a thumbs up on Pandora. On my computer, this averages about 4 songs a minute.

The Requests API and BeautifulSoup make writing screenscrapers and automating the web really clean and easy.

#!/usr/bin/python

# Takes a list of titles of songs, in the format "artist - song" and searches for each
# song on google. The first youtube link is passed off to youtube-dl to download it and 
# get the MP3 out. This doesn't have any throttling because (in theory) the conversion step
# takes enough time to provide throttling. 

import requests
import re
from BeautifulSoup import BeautifulSoup
from subprocess import call

def queryConverter(videoURL):
	call(["youtube-dl", "--extract-audio",  "--audio-format", "mp3", videoURL])

def queryGoogle(songTitle):
	reqPreamble = "https://www.google.nl/search"
	reqData = {'q':songTitle}
	r = requests.get(reqPreamble, params=reqData)
	if r.status_code != 200:
		print "Failed to issue request to {0}".format(r.url)
	else:
		bs = BeautifulSoup(r.text)
		tubelinks = bs.findAll("a", attrs={'href':re.compile("watch")})
		if len(tubelinks) > 0:
			vidUrl = re.search("https[^&]*", tubelinks[0]['href'])
			vidUrl = requests.utils.unquote(vidUrl.group(0))
			return vidUrl
		else:
			print "No video for {0}".format(songTitle)

if __name__=="__main__":
	with open("./all_pandora_likes", 'r') as inFile:
		for line in inFile:
			videoURL = queryGoogle(line)
			if videoURL is not None:
				queryConverter(videoURL)

PDB for n00bs

PDB is the python debugger, which is very handy for debugging scripts. I use it two ways.

If I’m having a problem with the script, I’ll put in the line

import pdb; pdb.set_trace()

just before where the problem occurs. Once the pdb line is hit, I get the interactive debugger and can start stepping through the program and seeing where it blows up, and what variables are getting set to before that happens.

However, I recently found a very handy second way. I was debugging a script with a curses interface, which cleans up when it exits. Unfortunately, that cleanup means that my terminal gets wiped when something crashes, so instead of a stack trace, I just get dumped back to the terminal when something goes wrong, with no information at all left on the screen.

Invoking the script with

python -m pdb ./my_script.py

gets me the postmortem debugger, so when something goes wrong, the program halts and I get the interactive debugger and some amount of stack trace. It’s messy looking because of curses, but I can at least see what is going on.

Playatech started charging for their plans

Unfortunately for burners, you can no longer download Playatech’s plans for their furniture without paying them first. They used to offer the plans as free downloads, and then asked that you donate some small amount if you used them.

Unfortunately for Playatech, they left all the PDFs in a world-readable directory. The command line below gets the index of that directory, finds all the lines with “pdf” in them, gets the file names out using cut, and then downloads each file.

for file in `wget -qO- http://playatech.com/wp-content/uploads/2013/05/ | grep pdf | cut -d ‘>’ -f 2 | cut -d ‘”‘ -f 2`; do wget http://playatech.com/wp-content/uploads/2013/05/$file; done

Flickr Downloadr that really works

Not my work. Get it here.

It does exactly what it says on the tin. This is letting me close a years-old open loop I had, which is that Flickr had a lot of my photos, but sucked so bad that I didn’t want to reward them with money in order to get my photos back.

As soon as the download is done, that Flickr account is toast.

Well THAT'S messy

for file in ../connections_2014-10-7-1*; do conn="-c ../connections_"`echo $file | cut -d "_" -f 2`; types="-t ../neuron_types_"`echo $file | cut -d "_" -f 2`; locs="-l ../locations_"`echo $file | cut -d "_" -f 2`; ./pickle_to_json.py $conn $types $locs; done

For all the connection files that were generated today, create three variables called “conn”, “types”, and “locs” that have a command line switch path in them generated from a fixed prefix and a cut from the connection file name. Then invoke the script “pickle_to_json.py” with those variables as arguments.

Effectively, the connection, neuron type, and location files are all related by their date, so this makes a single JSON file out of the multiple files. I just didn’t want to run pickle_to_json.py a bunch of times by hand, as that seemed error-prone.

Splitting a CSV file into a bunch of columns

awk -F, '{for(i=1;i<=NF;i++){print $i > "sample"i".csv"}}' yourfile.csv

Does what is says on the tin. Splits your CSV file into a bunch of files, one for each column of the original files. Found here.

I’m using this to pull single channels out of a 60 channel file full of recorded neuron voltages, which I’m then throwing through a little filter test program that I whipped up using this filter library. My main goal is getting rid of 60Hz line noise, but the fluorescent bulbs in the room apparently also make noise at 180Hz and 300Hz.

Useful mencoder invocation

mencoder -nosound mf://*.jpg -mf w=1280:h=800:type=jpg:fps=30 -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=2400:mbd=2:keyint=132:v4mv:vqmin=3:lumi_mask=0.07:dark_mask=0.2:mpeg_quant:scplx_mask=0.1:tcplx_mask=0.1:naq -o output_filename.avi

Turns all the JPEG files in the directory you are currently in into a nice quality MPEG-4/AVI file. The width and height in the options after -mf should be changed to match the images. This command line also works for PNG files if you replace both instances of “jpg” with “png”.

Naming things and "ImportError: No module named msg"

I’m using ROS at school for a project. Part of the project is to detect someone’s hand with a camera, so I’m just looking for a patch of “skin colored*” pixels. ROS organizes software as packages, with nodes in them, and messages that the nodes use to communicate with each other.

For my system, I had a package called “hand_detector” with a source file called “hand_detector.py” and a message type called “hand”. ROS generates the messages, which I then import into my python code with the line:

from hand_detector.msg import *

This gets me the error message: “ImportError: No module named msg”

The reason for this is that python searches the same directory as the executing script for imports before it goes looking anywhere else. Since the file hand_detector.py is the executing script, and is naturally in the directory with itself, python finds it there, imports it into itself, and then tries to find a module called “msg” within hand_detector.py. There’s no .msg in there, so I get the error.

The moral of the story here is don’t name your package and the script in it the same thing. Once I converted the script to just “detector.py”, the problem went away.

*I’m somewhat concerned that I wrote a “white people detector”, as it’s really just thresholding the H part of the HSV color space and counting pixels. Other color spaces may be better for this, but this doesn’t have to be perfect. I just don’t want the robot to be a dick to black people.

Command Line Audio Editing With Sox

For my ritual spoken word software piece, I recruited a bunch of my friends to say the text of the ritual. Each “stanza” of the ritual has a call and a response, so I broke each recording up into individual clips for each call and response. That gave me about 28 files per person, and over 100 clips total.

The different participants all recorded on different hardware, and at different volume levels. I also wasn’t super-precise about trimming the clips, so each file had silence at the beginning.

This left me with two problems: some participants were much softer than others, and some of the clips lagged each other, which made for bad chorus effects.

To trim the clips, I used sox, a Linux tool for manipulating sounds, with the command:

for file in *.wav; do sox $file $file.wav silence 1 0.1 2%

This results in a file named foo.wav.wav for each foo.wav file in the directory, so I cleaned up with:

rename -f “s/.wav././g” *

Note that this scribbles over the originals, so keep backups. I’m glad I did, because 2% turned out to be a little aggressive, and trimmed off the beginning of clips starting with an “ma-” sound, such as “make us a…”. This is likely because the sound faded in slowly, and so got counted as part of the noise rather than the beginning of a sound.

There is useful documentation for the sox silence filter here.

Turning the volume up on the files was done with:

for file in *.wav; do sox $file $file.wav gain -l 8; done

and another pass of rename, as above. Adjust the “8” up or down to suit your needs. Positive numbers make it louder, negative make it quieter.

If you want to preview a sox effect, just replace “sox” in the command with “play”, and leave off the output file. For example,

play myfile.wav gain -l 8

will play myfile.wav with increased gain, but won’t change the file.