Chatbots as Translation

I got a translation program based on deep neural networks (http://opennmt.net/ if anyone wants to play along at home). I’m training it on a corpus of my previous text chats. The “source language” is everything that everyone has said to me, and the “target language” is what I said in return. The resulting model should end up translating from things that someone says to appropriate responses. My endgame is to hook it up to an instant messaging client, so people can chat with a bot that poses as me.

This has a couple of problems. The main one is that statistical translation generally works by converting the input language into some abstract form that represents a meaning (I’m handwaving so hard here that I’m about to fly away) and then re-representing that meaning in the output language. Essentially, the overall concept is that there is some mathematical function that converts a string in the input language to a string in the output language while preserving meaning, and that function is what is learned by the neural network. Since what someone says and what I respond with have different, but related meanings, this isn’t really a translation problem.

The other problem comes up when I do the second part of this plan, which is to train the network with a question and answer corpus. At its most abstract, a question is a request to provide knowledge which is absent, and an answer is an expression of the desired knowledge. “Knowledge”, in this case, refers to factual data. One could attempt an argument that by training on a Q&A corpus, the network is encoding the knowledge within itself, as the weights used in the transformation function. As a result, the network “knows” at least the things that it has been trained on. This “knowing” is very different from the subjective experience of knowing that humans have, but given the possibility that consciousness and subjective experience may very well be an epiphenomenon, maybe it has some similarities.

Unfortunately, this starts to fall apart when the deep network attempts to generalize. Generalization, in this case, is producing outputs in response to inputs that are not part of the training input data set. If one trains a neural network for a simple temperature control task, where the input is a temperature, and the output is how far to open a coolant valve, the training data might look like this:

Temperature Valve Position
0 0 (totally closed)
10 0.1
20 0.2
30 0.3
40 0.4
50 0.5 (half open)
60 0.6
70 0.7
80 0.8
90 0.9
100 1.0 (fully open)

So far, so good. This is a pretty simple function to learn, the valve position is 0.01 * Temperature. The generalization comes in when the system is presented with a temperature that isn’t in the training data set, like 43.67 degrees, which one would hope results in a valve setting of 0.4367 or thereabouts. There is a problem that temperatures less than zero or greater than 100 degrees result in asking the valve to be more than completely shut, or more than fully open, but we can just assume that the valve has end stops and just doesn’t do that, rather than trying to automatically add a second valve and open that too.

The problem comes when we start generalizing across questions and answers. Assume there is some question in the training data that asks “My husband did this or that awful thing, should I leave him?” and the answer is mostly along the lines of “Yes, bail on that loser!”, and another question that asks “My husband did some annoying but not really awful thing, should I leave him?” and the answer is “No, concentrate on the good in your relationship and talk to him to work through it.” These are reasonable things to ask, and reasonable responses. Now imagine that there is a new question. The deep network does its input mapping to the space of questions, and the result (handwaved down to a single value for explanation purposes) falls somewhere between the representations for the “awful thing” question and the “annoying thing” question. Clearly, the result should fall somewhere between “DTMFA” and “Stick together”, but “Hang out near him” isn’t really good advice and “Split custody of the kids and pets, but keep living together” seems like bizzaro-world nonsense. There isn’t really a mathematical mapping for the midrange here. Humans have knowledge about how human relationships work, and models of how people act in them, that we use to reason about relationships and offer advice. This kind of knowing is not something deep networks do (and it’s not even something that anyone is trying to claim that they do), so I expect that there will be a lot of hilarious failures in this range.

Ultimately, this is what I’m hoping for. I’m doing this for the entertainment value of having something that offers answers to questions, but doesn’t really have any idea what you’re asking for or about, and so comes up with sequences of words that seem statistically related to it. We (humans) ascribe meaning to words. The deep network doesn’t. It performs math on representations of sequences of bytes. That the sequences have meaning to us doesn’t even enter into the calculations. As a result, its output has flaws that our knowledge allows us to perceive and reflect on.

Plus, I’m planning to get my Q&A corpus from Yahoo Answers, so not only will the results be indicative of a lack of knowing (in the human sense), they’ll also be amazingly low quality and poorly spelled.

 

TF-IDF in Python

I am trying to process a bunch of text generated by human users to figure out what they are talking about, in the context of an experiment where there were robots a person could be controlling, a few target areas the robots could move to, and a few things they could move, like a crate. One thing that occurred to me was to use TF-IDF, which, given a text and a collection of texts, tells you what words in the text are relatively unusual to that particular text, compared to the rest of the collection.

It turns out that that’s not really what I wanted, because the words that are “unusual” are not really the ones that the particular text is about.

select(1.836) all(3.628) red(2.529) robots(1.517) ((1.444) small(5.014) drag(2.375) near(3.915) lhs(4.321) of(1.276) screen(2.018) )(1.444)

This is a sentence from the collection, and the value after each word is the TF-IDF score for that word. The things I care about are that it’s about robots, and maybe a little that it’s about the left hand side of the screen. “Robots” actually got a pretty low score (barely more than “of”), but “small” got the highest score in the sentence.

At any rate, this is how I did my TF-IDF calculation. Add documents to the object with add_text, get scores with get_tfidf. It uses NLTK for tokenization, you could also use t.split(” “) to break strings up on spaces.

class TFIDF(object):
  def __init__(self):
    #Count of docs containing a word
    self.doc_counts = {}
    self.docs = 0.0

  def add_text(self, t):
    #We're adding a new doc
    self.docs += 1.0
    #Get all the unique words in this text
    uniques = list(set(nltk.word_tokenize(t)))
    for u in uniques:
      if u in self.doc_counts.keys():
        self.doc_counts[u] += 1
      else:
        self.doc_counts[u] = 1

  def get_tfidif(self, t):
    word_counts = {}
    #Count occurances of each word in this text
    words = nltk.word_tokenize(t)
    for w in words:
      if w in word_counts.keys():
        word_counts[w] += 1
      else:
        word_counts[w] = 1
    #Calculate the TF-IDF for each word
    tfidfs = []
    for w in words:
      #Word count is either 0 (It's in no docs), or the count
      w_docs = 0
      if w in self.doc_counts.keys():
        w_docs = self.doc_counts[w]

      #the 1 is to avoid div/zero for previously unseen words
      idf = math.log(self.docs/(1+w_docs))
      tf = word_counts[w]
      tfidfs.append((w, tf * idf))
    return tfidfs

 

IoTiocy

I may be the first to come up with this terrible neologism, but the time for it is certainly at hand. There have been a number of high-profile incidents where the security of stuff that was connected to the Internet was neglected in the rush to get it connected, or highly dubious design decisions were made in order to get an IoT “thing” to market.

The term originally came up when I was discussing this widget with a friend of mine. The NodeMCU boards are pretty sweet, and are based on the ESP-8266, which is also pretty sweet. The motor driver chip they used is from the 1970s, and there are far better ones available. The VNH2SP30 is a beast, and can supply 10x the current of the L293D, although you’d need two of them for dual motors, so there may be a horses for courses argument to be made there.

The DDoS that took out Krebs On Security, though, doesn’t really have a similar argument going for it. IoT security is hard because the devices don’t have a great interface for users to tighten up their security settings (if indeed, they have any security settings), and users don’t expect to have to tighten up the security of their appliances. As a result, insecure IoT stuff just kind of hangs out in dubious parts of the web, waiting for someone to make it a questionable offer of employment.

To the people who make things incomplete, weak, or insecure, I dedicate this new word:

IoTiot noun, informal

  1. A person or company that creates IoT devices lacking security, solid design, or even a purpose, usually in order to make a quick grab for cash.
  2. Any such device, once it starts behaving as it was designed (which is to say “badly”).

Similarly: IoTiotic, IoTiots, IoTiocy

 

The ancient Roman computer

I was recently assembling a slide deck on assistive technology and the sort of gradient that exists between assistive devices and human enhancement. We already have the technology to build exoskeletons that can give their user increased strength, or allow them to work longer with less fatigue (albeit only in specific types of tasks). These are physical enhancements, the same way that e.g. reading glasses on a person with normal vision act as magnifiers, and give them better close-in vision. The existence of physical enhancements begs the question of whether we can also have cognitive enhancements.

Donald Norman describes the use of Arabic numerals as a “cognitive artifact”, a tool that we use because it makes mathematical operations easier. One alternative in Western civilization is the Roman numeral system, which uses letters to represent values. The Roman numerals are I (1),  V (5),  X(10), L(50), C (100), D (500), and M (1000). So far so good, but Roman numerals are not a place-based system. Instead, the value of a number is a summation based on rules:

  1. Read from left to right, summing up values.
  2. Iif the number you are looking at is bigger than the number to its right, add it to the total.
  3. If the number you are looking at is smaller than the number to its right, subtract the number you’re looking at from the one to its right and add that.

So VI is 6 (5 + 1), but IV is 4 (5 -1), and VC is 95. The current year is MMXVII (1000 + 1000 + 10 + 5 + 1 + 1), which is OK to decode, but 1998 was MCMXCVIII or 1000 + (1000 – 100) + (100 – 10) + 5 + 1 + 1 + 1.  Looking at the Arabic expansion, the -100 and +100 cancel out, so it could be reduced to 1000 + 1000 – 10 + 5 + 1 + 1 + 1, so MXMVIII. There’s a trick to make addition of Roman numerals easier, which Norman describes, but doing multiplication is something of a nightmare.

I recently got to thinking about writing a Roman numeral library for a programming language, so that you could put this sort of thing in a program. This has no doubt been done already, so I’m not going to redo it, but my first thought was that I’d just write a translation from Roman to Arabic numerals (which pretty much all programming languages use already) and back, and then I’d do the math in Arabic numerals. That’s not a Roman numeral library, though. That’s an Arabic library with a Roman presentation layer.

Unfortunately, on most computer architectures, there’s no other way to do it. The binary ALU in processors is a place-value based system. It’s essentially Arabic numerals for people with one finger, where you carry to the next higher place as soon as you have more than one of something. An implementation of Roman numerals on pretty much any CPU will be turned into an Arabic-like representation before it gets operated on.

This got me thinking about what it would be like to attempt to build an ALU, or a simple handheld four-function calculator, that operated in Roman numerals natively. That is to say, the voltages in the various “bits” would represent I, V, X, L, C, D, and M, instead of a place-based binary representation. This immediately runs into a LOT of problems. For starters, the cool thing about binary is that you operate the transistors in switching mode, either on or off, rather than linear mode. Current either flows with minimal resistance, or doesn’t flow. In linear mode, to get multiple voltages, the transistors act as resistors, dissipating some of the power as heat. After that, you have to deal with the context-based values, rather than place-based values. Imagine that we have an adder, with two inputs, A and B, and we feed it A = I, B = V. Clearly, the output has to be, uh… well, it has to be 4, but there isn’t a numeral for that. There’s no single-digit Roman numeral that is 4. But lets say that we don’t care about not having a representation, and that this is just an analog computer. And so that our heads don’t explode trying to keep things straight, lets say that the voltage for ‘4’ is a little less than the voltage for V, but 4 times the voltage for I. This is nice because the voltage for IIII is the same as the voltage for IV. So far so good, but now we give our adder A = V, B = I, and it has to put out ‘6’, which we also don’t have a representation for. We’re clever, it’s analog, the voltage that represents ‘6’ is the voltage for ‘V’ plus the voltage for ‘I’. So if A > B, the output is A + B, if A < B, the output is A-B, and if A = B, the output is A + B (VV is 10, I suppose, but so is X). This is all doable, I’m sure, probably with op-amps, some of which would be acting as comparators, but it’s complex. More complexity is more transistors and components, and so more waste heat and chances to get the wiring wrong.

Annoyingly, that’s not the end of our problems. In the last couple of paragraphs, there are some ambiguous representations of numbers. VV is 10, but so is X. In Arabic numerals, 9 is 9, and 9 is the only digit that’s 9. Arabic also has the nice property that you can guess about how big the representation of the result is, based on the size of the operands. There’s no way to add a 4 digit number and a four digit number (assuming they’re both positive), and get a 2 digit number, or a 12 digit number. Adding II to MXMVIII gets you XX. Multiply two Arabic numbers, and the result will be about the length of the two numbers stuck end to end, +/-1. Division will be the difference in length of the numbers (I’m handwaving here and ignoring fractions). Multiply two Roman numerals, and your guess is as good as mine (As Norman points out, the written length of an Arabic numeral is also a rough gauge of it’s value. 1999 (length 4) is bigger than 322 (length 3), but MXMVIII (length 7) is less than MM (length 2)). There may be some regularities, but I’m not about to sit down and work them out. If anyone does, the bound on the Roman computer word size would be whatever number is longest and less than whatever size of MACHINE_MAX_INT you feel like implementing in the silicon space and power-hungry representation that the machine uses. As a result, a natively Roman calculator would potentially have very long “words” or “bytes”, much of which would go unused most of the time.

Roman numerals also didn’t have a standard representation for fractions, although they apparently sometimes used a duodecimal system for them (yes, two different number systems, one for integers and one for fractions). Roman numerals also had no formal representation at all for zero, although the Romans clearly understood the idea of zero (You have ten dinars, and you owe me ten dinars. I collect your debt, how many tunics can you buy now?). They used “nulla” or “nihil” sometimes, and represented it with an N, but they wouldn’t have written 500 as VNN.  They also didn’t represent negative numbers at all, although there were probably cases where they understood a subtraction to result in something that wasn’t even zero (You have 10 dinars and owe me 15 dinars. I collect your debt. Clearly you don’t have -5 dinars, but I also clearly don’t consider myself fully repaid).

I’ve heard that the Roman conception of math and numbers was highly geometric, rather than arithmetical, so they did have things like the square root of 2, which is the length of the diagonal of a unit square (you know, a square with sides of length I). Pythagoras figured that one out, at roughly the same time people were using “nulla”. Instead of running on abstractions of the representations of numbers, a properly Native Roman Computer might have been a mechanical device, a calculator that operated by using finely crafted physical representations of circles and triangles to perform its operations. Interestingly, the closest thing we have to a calculator from that time period is exactly that.

More Webcams, Fewer Problems

I want to have two applications using the same webcam under Linux, in my case an app using Kivy to render the webcam view and let people interact with it, and ROS.

My first hope had been to have a Kivy app that subscribed to a ROS image topic and displayed the resulting image, but I’ve spent two days beating on it and only found a variety of ways to make a python program give me a segfault.

The new plan is to use v4l2loopback to create two virtual cameras, and have them both fed from /dev/video0 (the actual camera).

sudo modprobe v4l2loopback devices=2

That gets me the two devices. The thing is, these devices don’t actually output any video. They’re just fake video devices that a program can write to. There are a lot of instructions on feeding them with ffmpeg. For some reason, my computer says it has ffmpeg installed, but locate can’t locate it. Instead, I’ve used gstreamer to set up /dev/video1 as a video device that ROS’s usb_cam node can handle, and configured the launch file for usb_cam to read /dev/video1

gst-launch v4l2src device=/dev/video0 ! video/x-raw-yuv,width=640,height=480,framerate=15/1 ! v4l2sink device=/dev/video1

That solves half of my problem. The other half is having the second video device get fed, and getting Kivy to read /dev/video2.

gst-launch v4l2src device=/dev/video0 ! video/x-raw-yuv,width=640,height=480,framerate=15/1 ! tee name=t ! v4l2sink device=/dev/video1 t. ! queue ! v4l2sink device=/dev/video2

Adding the tee and queue gives me two video devices playing the same video, both fed by /dev/video0. The usb_cam ROS node can read either one of them. The basic Kivy camera app that I bashed together doesn’t work, though. It seems to default to trying to open /dev/video0, and then failing because the gst-launch invocation is using it.

AnchorLayout:
anchor_x:"center"
anchor_y:"center"
   Camera:
      id: camera
      resolution:(640,480)
      play: True
      index:2
   KeyboardListener:
      id:kbd_lstn

Adding index:2 to my Kivy app’s .kv file gets me a kivy app with my video in it. As long as my ROS nodes are looking at /dev/video1 and my Kivy app is looking at /dev/video2, no one steps on anyone’s toes, and they both can operate at the same time.

Lightlocker does not play well with NVida (or something)

If you have an Ubuntu system with an NVida video card, and after suspending and resuming you get a login screen, but after you log in you get a black screen, and using ctrl + alt + F1 gets you a text console, your problem is probably lightlocker. I’ve seen some sites claim that the problem is fixable with sudo apt-get remove --purge nvidia-* which would be great if you didn’t want video drivers or CUDA. I’d rather hang onto my video drivers, so sudo apt-get remove light-locker makes everything better.

ROS and OpenCV will fite u, m8

I recently wanted to do some computer vision stuff using OpenFace, which is a collection of face-processing computer vision algorithms and tools to use them. It uses OpenCV 3.0.something, which uses, among other things, vtk6, and friends libvtk6-dev and python-vtk6.

Normally, this wouldn’t be a problem, but I use ROS Indigo, as does the lab I work in. ROS Indigo uses some previous version of vtk, and so attempting to install OpenCV 3.0 blows away my ROS install, and makes apt freak out when I try to install it again. The actual error was something like “you have broken held packages”, only I didn’t actually have held packages OR broken packages.

Apt just gives up at this point. Aptitude, on the other hand, proposes removing the offending VTK packages and proceeding with the ROS install. Only time will tell if I’ve trashed my OpenCV install, but if I have, I can just go back to an older OpenCV version.

Download ALL The Music

Given a file containing a list of songs, one per line, in the format “Artist – Song Title”, download the audio of the first youtube video link on a Google search for that song. This is quite useful if you want to the MP3 for every song you ever gave a thumbs up on Pandora. On my computer, this averages about 4 songs a minute.

The Requests API and BeautifulSoup make writing screenscrapers and automating the web really clean and easy.

#!/usr/bin/python

# Takes a list of titles of songs, in the format "artist - song" and searches for each
# song on google. The first youtube link is passed off to youtube-dl to download it and 
# get the MP3 out. This doesn't have any throttling because (in theory) the conversion step
# takes enough time to provide throttling. 

import requests
import re
from BeautifulSoup import BeautifulSoup
from subprocess import call

def queryConverter(videoURL):
	call(["youtube-dl", "--extract-audio",  "--audio-format", "mp3", videoURL])

def queryGoogle(songTitle):
	reqPreamble = "https://www.google.nl/search"
	reqData = {'q':songTitle}
	r = requests.get(reqPreamble, params=reqData)
	if r.status_code != 200:
		print "Failed to issue request to {0}".format(r.url)
	else:
		bs = BeautifulSoup(r.text)
		tubelinks = bs.findAll("a", attrs={'href':re.compile("watch")})
		if len(tubelinks) > 0:
			vidUrl = re.search("https[^&]*", tubelinks[0]['href'])
			vidUrl = requests.utils.unquote(vidUrl.group(0))
			return vidUrl
		else:
			print "No video for {0}".format(songTitle)

if __name__=="__main__":
	with open("./all_pandora_likes", 'r') as inFile:
		for line in inFile:
			videoURL = queryGoogle(line)
			if videoURL is not None:
				queryConverter(videoURL)

Further Troubles with TinyRobos

The white version of the TinyRobo board that has the missing ground trace also doesn’t have a proper connection for the pullups on the I2C address lines for the motor drivers. The drivers are still there, and scanning the I2C bus with a Bus Pirate (Amazon) showed me that they were at 0x63 and 0x64 on the I2C bus, rather than where I expected them to be (at 0x66 and 0x68). The difference is consistent with a connection that should have been to Vcc being left open.

I’m not wild about the problem, but it did give me an opportunity to set up and use Pulseview/Sigrok, my cheap clone logic analyzer, and my Bus Pirate, so it’s not a total waste.

For my own future reference, as well as anyone else who’s interested, the way to set up the Bus Pirate on Ubuntu is this:

  1. Plug it into a USB port
  2. Open up a terminal and type screen /dev/buspirate 115200 8N1 , where /dev/buspirate is whatever device your bus pirate ended up on. Mine was /dev/ttyUSB0.
  3. The terminal will go blank. Hit enter, and you should get the “HiZ>” Bus Pirate terminal.

The I2C bus scan is run  by hitting “m” to get the menu, “4” to get I2C mode, “3” to set speed to 100kHz, and then “(1)” to run the scan macro.

The cheapo logic analyzer I got is a USBee & Saleae clone, which I got because I’m bad and should feel bad not rich. It has a switch to determine which device it claims to be. In Saleae mode, Sigrok loads an alternate firmware onto it, so I’m not really sure where that falls in the intellectual property/doing the right thing by small businesses framework, but if you can afford one, get a proper USBee or Saleae. They’re much better built (the Saleae Logics in particular are tanks) and have more and better features.

I’m doing all this stuff in Enschede, at U. Twente. I’ve been hanging out with the people in the HMI group, which “does things with stuff”. They do a lot of work with things like proxemics in interaction, socially aware robots and technology, and so forth. There’s something of a distinction here between technical stuff, which is what I do a lot of, and more abstract work with avatars and such. I’m a better fit with the RaM (Robotics and Mechatronics) group, which builds things like pipe-crawling robots and quadcopters.

Bugs in new boards

I seem to have left a ground connection off the PCB, which causes the 3.3v regulator to not work. I’ve fixed it in the PCB design in the repository, but the DirtyPCBs order link goes to a product that doesn’t work, so I still have to fix that.

Since I’m going to have to do a new version of the PCB anyway, I’ve added a blinky light on one of the IO pins so that I can have an additional channel for debug information.