Chatbots as Translation
I got a translation program based on deep neural networks (http://opennmt.net/ if anyone wants to play along at home). I’m training it on a corpus of my previous text chats. The “source language” is everything that everyone has said to me, and the “target language” is what I said in return. The resulting model should end up translating from things that someone says to appropriate responses. My endgame is to hook it up to an instant messaging client, so people can chat with a bot that poses as me.
This has a couple of problems. The main one is that statistical translation generally works by converting the input language into some abstract form that represents a meaning (I’m handwaving so hard here that I’m about to fly away) and then re-representing that meaning in the output language. Essentially, the overall concept is that there is some mathematical function that converts a string in the input language to a string in the output language while preserving meaning, and that function is what is learned by the neural network. Since what someone says and what I respond with have different, but related meanings, this isn’t really a translation problem.
The other problem comes up when I do the second part of this plan, which is to train the network with a question and answer corpus. At its most abstract, a question is a request to provide knowledge which is absent, and an answer is an expression of the desired knowledge. “Knowledge”, in this case, refers to factual data. One could attempt an argument that by training on a Q&A corpus, the network is encoding the knowledge within itself, as the weights used in the transformation function. As a result, the network “knows” at least the things that it has been trained on. This “knowing” is very different from the subjective experience of knowing that humans have, but given the possibility that consciousness and subjective experience may very well be an epiphenomenon, maybe it has some similarities.
Unfortunately, this starts to fall apart when the deep network attempts to generalize. Generalization, in this case, is producing outputs in response to inputs that are not part of the training input data set. If one trains a neural network for a simple temperature control task, where the input is a temperature, and the output is how far to open a coolant valve, the training data might look like this:
|0||0 (totally closed)|
|50||0.5 (half open)|
|100||1.0 (fully open)|
So far, so good. This is a pretty simple function to learn, the valve position is 0.01 * Temperature. The generalization comes in when the system is presented with a temperature that isn’t in the training data set, like 43.67 degrees, which one would hope results in a valve setting of 0.4367 or thereabouts. There is a problem that temperatures less than zero or greater than 100 degrees result in asking the valve to be more than completely shut, or more than fully open, but we can just assume that the valve has end stops and just doesn’t do that, rather than trying to automatically add a second valve and open that too.
The problem comes when we start generalizing across questions and answers. Assume there is some question in the training data that asks “My husband did this or that awful thing, should I leave him?” and the answer is mostly along the lines of “Yes, bail on that loser!”, and another question that asks “My husband did some annoying but not really awful thing, should I leave him?” and the answer is “No, concentrate on the good in your relationship and talk to him to work through it.” These are reasonable things to ask, and reasonable responses. Now imagine that there is a new question. The deep network does its input mapping to the space of questions, and the result (handwaved down to a single value for explanation purposes) falls somewhere between the representations for the “awful thing” question and the “annoying thing” question. Clearly, the result should fall somewhere between “DTMFA” and “Stick together”, but “Hang out near him” isn’t really good advice and “Split custody of the kids and pets, but keep living together” seems like bizzaro-world nonsense. There isn’t really a mathematical mapping for the midrange here. Humans have knowledge about how human relationships work, and models of how people act in them, that we use to reason about relationships and offer advice. This kind of knowing is not something deep networks do (and it’s not even something that anyone is trying to claim that they do), so I expect that there will be a lot of hilarious failures in this range.
Ultimately, this is what I’m hoping for. I’m doing this for the entertainment value of having something that offers answers to questions, but doesn’t really have any idea what you’re asking for or about, and so comes up with sequences of words that seem statistically related to it. We (humans) ascribe meaning to words. The deep network doesn’t. It performs math on representations of sequences of bytes. That the sequences have meaning to us doesn’t even enter into the calculations. As a result, its output has flaws that our knowledge allows us to perceive and reflect on.
Plus, I’m planning to get my Q&A corpus from Yahoo Answers, so not only will the results be indicative of a lack of knowing (in the human sense), they’ll also be amazingly low quality and poorly spelled.
Web Development with Flask and Python
I haven’t done anything that could properly be called web development since about 2002, when I took a college course in it. There have been a few developments in the field since then, and I’m a little rusty.
I chose Flask, because I like python and because Django seems like overkill for what I’m doing. There are literally dozens of frameworks out there, and I imagine some people know and care about the differences between them. If I had comments enabled, they’d be yelling at me to switch to Django right now. Hence, no comments.
Installing Flask is easy on Ubuntu:
sudo apt-get install python-flask
I copied the Flask “hello world” from the Flask page, ran it with
which got me the expected result, a web server running on port 5000 with a “Hello World” message.
My plan for this web app is to have users be able to visit some page, and the page will contain an image of a snowflake, generated from the url they used to visit. That’s it, but over on Facebook, my post about generating snowflakes from people’s names made people go just about nuts asking for them. Rather than generate them myself, I figured I’d write a web app.
Flask is a joy to work with. In debug mode, it detects changes to the file that contains the currently running app, and restarts when that file changes. It also presents a traceback and interactive debugger if something goes wrong in your app (it goes without saying that this needs to NEVER reach production, since it’s an interactive python shell on your server).
At this point, I have the core functionality of the app together, and I’m not even done with my beer. I can visit a URL, and a snowflake image gets generated from that URL. Everything else is details, and then deployment.
A couple of downloads later, the image now gets converted to png, and served back to the user as an image in a page. Soon, deployment!
Lessons learned by being The Worst Game Developer
I’m writing a video game. It is called Pebble, and in Pebble, there is a pebble. You contemplate the pebble. I haven’t decided if there is going to be music or not, but there will be a pebble, in a featureless grey expanse, and you can contemplate it.
Just thinking about writing this game has brought me some interesting realizations. I doubt I’m the first one to have them, but it was neat to see how they all fit together.
The first realization is just a recap of things I already knew about developing software: “You’re going to throw the first one away” and “Do the simplest thing that could possibly work”.
When I first came up with the idea for Pebble, it was as a tech demo for Tree, which is similar (There is a tree, you contemplate it), but more complicated, in that a tree is larger. I was going to use level-of-detail (LoD) rendering to support real-time generative zoom from birds-eye to bugs-eye views, store seeds so that the generated versions didn’t change between runs, etc. I read a bunch of papers on the topics, and saw that it was all very complex. I also hadn’t written anything, despite having read a lot of papers and learned a bit.
Eventually, I realized that if I had to load everything I needed to know into my head to write this game, first, I wouldn’t get around to writing it, and second, my head would explode.
Instead of either of those things, I’m writing the simplest bit of code that will draw something on my screen. The first version will draw a polygon, the second version will rotate it, and the third version will texture it. I’m going to have two code streams, one written using openFrameworks and one written using Polycode, so I can decide which of those libraries I’d rather use.
Once both libraries are through three versions, I’ll have the simplest thing that could possibly work, and I’ll throw the other one away.
Another revelation I had is that I don’t really know what pebbles look like. I mean, I have a general idea, but to render a pebble, a general idea doesn’t cut it. It doesn’t capture the variety of surface types that different kinds of weathering can cause, the colors of all the different rocks, and so forth. The reality of pebbles is way more complicated than the idea of pebbles
My girlfriend and I went out on a beach on Cape Cod, Massachusetts, and looked at pebbles. Cape Cod is a terminal moraine, so the rocks there were pushed by glaciers from everywhere north of Cape Cod, and there are loads of different kinds of pebbles there.
This has two effects on my thinking about the design of Pebble, and of video games in general. The first is that the stone surface generation algorithim should be the simplest thing that could possibly work. The second is that AAA games in their current form are doomed.
AAA games have a huge amount of their budget dedicated to resources, such as the textures and designs of the characters. Because the current marketing push in video games is visual, each game is supposed to have better and better graphics than those before it, or people will mock it and it will loose sales. However, this is an infinite pit. Any game world is a map, a less-detailed reperesentation that conveys an impression of a more detailed real world. With real maps, the real world is assumed to also exist, but in games it doesn’t. You run around in Libery City in Grand Theft Auto, but “you” don’t “run” “around”. By pressing buttons, you cause the appearance of motion in a simulated person within a simulated, restricted world. The better the simulation gets, the more resources it requires. In real-world NYC, if you go to Battery Park, you can pick up gravel and throw it in the harbor. In the analogous unplaces in GTA, the ground is a perfect solid, smooth and impenetrable. In order to create a more perfect simulation, there would have to be simulated pebbles, and someone would have to create them.
All of these resources, the pebbles, clothes, guns, car tires, trees, buildings, and so forth in a video game are made by people. These people get paid, and so the more detail you want in a game, the more resources you need, and so the more people you have to pay. Taking longer to make the game doesn’t work, as the technology is constantly shifting, so “more people” is pretty much the only going solution at this point. Even licensing IP from other companies is just an abstraction of getting more people to work on the project.
As a result, the drive is now to make games more and more expensive to make, in order to get finer and finer quality of details that add nothing to the narrative, but make the finished package prettier. However, people are not going to pay hundreds of dollars for a game (except possibly that version of MechWarror that came with a big robot control console), so either the game market has to grow without bound, or the industry has to start putting an upper bound on how much they can invest in making a game.
In a way, I’m hoping Pebble is a signpost on the path of excessive detail, a huge amount of clever rendering algorithims and generative textures in pursuit of the perfect simulation of the experience of contemplating a small stone. Whether the signpost says “Welcome!” or “Abandon Hope All Ye Who Enter Here” is an exercise for the reader.
Pebbles and antlers
I have a bunch of ideas for little applications.
One is a game-like interactive entertainment where you contemplate a pebble. You can turn it around, zoom into its surface, etc. That’s it, just examining a pebble. The trick is, this is actually technically somewhat complex. Pebbles have structure ranging from overall shape to near-microscopic scratches and dents. To faithfully render the pebble, all of that has to show up as you look around. To save processor time and storage space, I plan to generate all of that on the fly. I’ve been advised that I should either use the Unity game engine, or write my own engine to do this.
Pebble is a proof of concept, really. The original idea was to write a game where you examine a tree. You can fly all around it, zoom into the cracks in the bark until they tower over you like canyons, zoom out until the tree is a green spot below you. There is no way to sit down and hand-make all of that data, so it pretty much has to be generated. Pebble seems like something I can do to determine that the concept is possible.
(I’m not sure if my interest in generative content is an indication that I’m into elegant, intelligent solutions, or just that I’m profoundly lazy.)
Another application is a piece of software that generates antlers. You run it, and it spits out an STL file that you can use to get the antlers 3D printed. You can modify things like how the antlers branch and curve, to get things like ibex horns or deer antlers.
(Oddly, thinking about these programs, I realize that I don’t know what trees, pebbles, or antlers look like. I mean, I can identify them, and even sort of describe them, but not with anything like the sort of description required to produce them. This is, I think, the difference between looking at things like an artist and looking at them like someone who just doesn’t want to bump into stuff.)
Works from a “school” of art share some common elements. Looking at paintings by Dali, Magritte, and Breton, one can say that they share something that is not shared with a Monet. People not trained in the academic study of art might have a hard time naming or articulating that quality, but it is definitely present.
The artists named above are all painters. If one wants to get truly pedantic, it’s possible to claim that their works all have the common quality “flat surface covered by pigments mixed with a binder”. The actual common quality is more a matter of their treatment of form, especially in relation to the expected juxtaposition of forms in the real world, and their engagement with the representation of the unconscious world, that is to say, the realms of dream, delusion, and insanity, as well as direct handling of the duality of representation and reality.
From the fact that this common quality does not directly relate to the material used, we can infer that there can exist works that do not use the same material, and yet have the same quality. This inference is supported by the existance of surrealist sculpture.
However, some materials and creative processes force a certain common developmental aesthetic. Three cases of a unified aesthetic that is incidental to the product, but nonetheless shared, are: the textures used in 3D modeling, the debug output of computer vision systems, and the appearance of DIY/prototyped electromechanical devices from the current generation of hacker spaces.
These aesthetics are unified within themselves, but they are not of a piece with each other. Textures adopt the form that they do because the technology demands it. The technology is defined, and the aesthetic is fully constrained by it. Computer vision systems develop their aesthetic because they must map the world through the system’s understanding into a form that is understood by the human user. The technology is not fully defined, but the system is confined on on three fronts: The input of the real world, the representation available in the system, and what users can “read” in realtime. Prototyped devices have the fewest constraints. The technology is incompletely defined, and the form of it is also undefined, so it is shaped by expedience and available tools. It is the most accidental aesthetic, because it is the one that forms when no other aesthetic is selected.
This is an example of a texture for a human head from here. The distortion would be corrected by remapping onto a model of a human head.
Textures are the most rigidly constrained accidental aesthetic. This description comes from a common modeling file format, but the technology is similar across many modeling processes. The model consists of three files. The first file, the model file, describes the 3D points that make up the surfaces of the model. It also includes a reference to the second file, which is a material file. The material file describes a set of materials that the object is made of, and how light interacts with them. Each material may refer to a third file, which is the texture. A texture is a flat image file. Regions of the flat image file are mapped onto surfaces of the model by a one-to-one (usually) mapping from vertices on the model to vertices on the texture. The vertices on the texture define a shape which is then “cut out” and “applied” to the corresponding shape on the model. Because of the way this works, and the tools used to create this mapping, the texture is frequently a flat representation of the 3D object, in much the way a map of the earth is a flat representation of the 3D world.
Altering the texture would result in changes to its display on the model, so the texture is completely constrained by the model. Because it is a flat image file, the texture is also constrained in the ways that it can be displayed to the user. Because of this complete constraint, the textures display a very strong unity of aesthetic.
Robot readable world from Timo on Vimeo.
Robot Readable World is a compilation of the debugging output of computer vision algorithms. The computer system operates on the video stream to produce data streams which are not visible to humans. These video outputs are intended to allow human debuggers to determine what the system “sees”, that is, to map the data structures into human-readable form and present it mixed with the incoming images so that the person can relate from real objects to the system’s “perception”. Because these are merely explanations of the state of the system, rather than a key part of its functioning, they can be altered and rearranged to provide the maximally useful representation for human readers. The data underneath may not change, but the presentation can be altered.
As a result, these systems are unconstrained at at least one end, the presentation to the user. However, they are constrained at the other end to operate on images. The images are in turn, constrained by the postions and relations of objects in the real world. A computer vision system that operates in a made-up or simulated environment would have no practical use to humans unless they also inhabited that environment. This is not to say that this is not done, as vision approaches could be used in video games, but it is less likely.
This dog treat dispenser is an example of the third accidental aesthetic: the design of DIY electronics. Some hallmarks of this aesthetic are the exposed circuit boards, the surface texturing of 3D printed or laser cut (in this case, 3D printed) parts, visible and accessible wiring, and the use of visible, commercially available screws and other connectors.
This project, a controller for a coffee roaster, has the same aesthetic, despite being constructed by a different person, unknown to the maker of the dog treat dispenser.
This is the least constrained of the three accidental aesthetics. The maker can choose the parts used to create the device, and the form of the finished device. However, the tools available to the user to create the device will drive certain decisions in its eventual form. A 3D printer provides a way to quickly create certain forms, but has a distinct material, texture, and color for those forms. Laser cutting allows a form to be built from layers of flat materials, but again, some building techniques work better than others. Off the shelf commercial components have to be connected together, which leads to visible wires. All of these decisions, to print or not print, laser or not laser, wire or make PCBs have a bias in them that each artist/creator navigates, and the sequence of the decisions leads to a particular aesthetic for the piece.
Panning An LED Spotlight Along a Trail
This describes something I’m planning to do, rather than something I’ve done, but I don’t think it is total hogwash.
I attend an event where it would be useful, for shock and awe reasons, to pan a spotlight along a segment of trail, stop, and pan back to the beginning of the segment.
Building a spotlight to do this is easy. You put a few big, powerful red LEDs behind collimating lenses, mount that on a servo pan-tilt unit, and tell an Arduino what positions to put the servos at to do the sweep.
That last bit is where it falls apart. At first, I was playing with ideas around converting coordinates from the space around the spotlight into rotations of the spotlight axes, and how to establish what the coordinates are, and how to do that transformation. Then I realized that I was overthinking it, and all I really need to do is this:
- Make a program that moves the light in accordance with the mouse. Up and down on the mouse are rotation in one axis, left and right are rotation in the other axis. Clicking memorizes a point.
- Pan the spotlight along the trail to generate the sequence of points that the spotlight has to hit. Record the positions of the two servos at each of those points with the “click to memorize” function.
- The positions of the servos are points in a 2-D space. Spline or otherwise interpolate between those points to generate an arbitrary number of points.
- Move the servo through the interpolated points, stop, move it back, repeat until the batteries die.
Figuring out the points this way takes care of any nonlinearity in the servos, curvature of the terrain, etc, by having the device that was used to measure the points be the same device that later executed them.
The idea of moving a system through the course of actions you want it to take isn’t a new one. Industrial robots frequently have “teaching pendants” that allow a human to put the robot through the sequence of actions it would take to perform an operation. Once the human is done, the robot starts repeating those actions, with great accuracy and reliability.
Command Line Audio Editing With Sox
For my ritual spoken word software piece, I recruited a bunch of my friends to say the text of the ritual. Each “stanza” of the ritual has a call and a response, so I broke each recording up into individual clips for each call and response. That gave me about 28 files per person, and over 100 clips total.
The different participants all recorded on different hardware, and at different volume levels. I also wasn’t super-precise about trimming the clips, so each file had silence at the beginning.
This left me with two problems: some participants were much softer than others, and some of the clips lagged each other, which made for bad chorus effects.
To trim the clips, I used sox, a Linux tool for manipulating sounds, with the command:
for file in *.wav; do sox $file $file.wav silence 1 0.1 2%
This results in a file named foo.wav.wav for each foo.wav file in the directory, so I cleaned up with:
rename -f “s/.wav././g” *
Note that this scribbles over the originals, so keep backups. I’m glad I did, because 2% turned out to be a little aggressive, and trimmed off the beginning of clips starting with an “ma-” sound, such as “make us a…”. This is likely because the sound faded in slowly, and so got counted as part of the noise rather than the beginning of a sound.
There is useful documentation for the sox silence filter here.
Turning the volume up on the files was done with:
for file in *.wav; do sox $file $file.wav gain -l 8; done
and another pass of rename, as above. Adjust the “8” up or down to suit your needs. Positive numbers make it louder, negative make it quieter.
If you want to preview a sox effect, just replace “sox” in the command with “play”, and leave off the output file. For example,
play myfile.wav gain -l 8
will play myfile.wav with increased gain, but won’t change the file.
We Make Ritual Noise
For a festival that I attend, I’m writing a soundscape in boodler to provide the vocal component for a ritual. Here, I’m going to annotate what I need to do to run Boodler on my laptop, which I’ll have at the festival.
The main thing is that Boodler seems to default to OSS, and I use PulseAudio, so to invoke the ritual, you need to run:
boodler -o pulse –external disturbingrelics com.gizmosmith.disturbingrelics/Example
The -o option tells it to use PulseAudio, –external makes it load from a directory instead of a .boop package for testing purposes, and the rest is the agent to run.
To organize all the sound clips I’m using, I have a boodler package for each person’s reading of the ritual script. The script is in a call and response format, with 14 calls and responses, so each package has 28 audio clips, one each for the call and response. I named all the clips “call_N_…” and “response_N_…” (for N in 1..14) so that the program can figure out the call/response pairs by name.
Each package starts out as a directory with the 28 files and a metadata file in them. For the directory “sage”, I create the package with:
boodle-mgr –import create sage
and then install it with:
boodle-mgr install ./com.gizmosmith.sage.1.0.boop
For a long time, I have been thinking about making a set of pin badges that signal to each other using an IR pulse, and blink with a visible light. If a few of them were together, they would synchronize their lights, like fireflies.
To do that, I’d want to detect IR, and reject IR that was not from one of the badges (like light from a fire or the sun). The IR receivers that go into TVs do that, and there are super-small versions available for surface-mount construction.
Osram’s SFH5410 looks perfect, but was discontinued, this time in 2010, and without a suggested replacement.
Sharp makes the GP1US30XP series in surface mount packages about 4x4mm. Their sensing direction is normal to the PCB, which is ideal for aiming them away from the wearer. This would be the ideal part, if but it was also discontinued.
For purposes of prototyping, I can use some larger parts that I already have, but the ultimate design will be very small. Someday, perhaps I’ll have them fabricated professionaly and give them away at Firefly.
Academic Problems in New Media Art
I’m trying to convert Deluze & Guattari’s A Thousand Plateaus into a corpus for use by a program. Among other things, the program will break the text down into its component sentences. The text has a lot of notes, which connect the text to a lot of other sources, but are not always written in complete sentences, and so will result in odd output from the program when considered as sentences.
So I’m faced with a choice: lose the notes, and so lose the cultural context and references to stuff that went before, or keep them and suffer degraded output. Nobody warns you about the odd stuff you’ll have to decide when you start doing new media “art”.