Sunday, February 5, 2012

Perceptrons in Lisp (A simple machine learning exercise)

So having missed Stanford's Machine Learning course (mostly out of laziness - I'm sure it was great) I'm trying to learn this stuff on my own. I'm going through MIT's Machine Learning notes on OpenCourseWare. They're easy [for me] to digest without being insulting, and they help me avoid searching for "The right book" to learn from (a task that would delay my learning anything but make me feel busy).

After reading the first two lectures I decided I should stop and practice what I've learned: a simple perceptron learning algorithm.

What's a Perceptron anyway?

It sounds like a Transformer. It's actually a method of deciding whether an object belongs to a certain class based on a linear combination of that object's characteristics. Id est, say we have some data on atmospheric conditions, exempli gratia humidity H, wind speed W, et cetera and we want to classify those methods as to whether the conditions will produce a storm a not. We would have a linear term:

$a_1 H + a_2 W \ldots$

We want to choose the variables $a_i$ so that the above term is positive when we'll have a storm, and negative otherwise.

More generally, say we have a vector of characteristics $x$. We want to choose another vector $\boldsymbol\sigma$ so that the dot product $\textbf{x} \cdot \boldsymbol\sigma$ is positive when $x$ belongs to our class and negative otherwise. The perceptron is the function

$f (\textbf{x}) = \text{sign}(\boldsymbol\sigma \cdot \textbf{x})$

How do we find $\boldsymbol\sigma$?

Our learning algorithm will tell us how to choose that $\boldsymbol\sigma$. To do this, we need some training data. Id est, some vectors $\textbf{x}_1, \ldots, \textbf{x}_n$ along with labels $y_i$ where $y_i = 1$ if $\textbf{x}_i$ belongs to our class and $y_i = -1$ otherwise.

We're going to start with any ol' $\boldsymbol\sigma$, say just a vector with 1's in all positions. For $\textbf{x}_1$, we look at $f(\textbf{x}_1)$. If it equals $y_1$, we do nothing. If not, then we update $\boldsymbol\sigma$ with $\boldsymbol\sigma = \boldsymbol\sigma + y_1 \textbf{x}_1$. One can prove that this update rule will ensure that $f$ gets better at correctly classifying each $\textbf{x}_i$. We repeat this for all of our training data, and then we have a perceptron that's ready to classify some samples!

Let's see it in practice.

This is a pretty simple algorithm and won't take us long to implement, so let's try it. For our implementation, we're going to look at 25x25 pixel black and white images. We'll train a perceptron to identify which images are of the letter A and which aren't.

I [rather foolishly] created my own training data. I used GIMP to create 25x25 pixel images, saved them as an html table. Each cell of the table corresponded to a pixel, and GIMP saved them so that each td tag included a BGCOLOR attribute. I then used sed to convert each html file into a list of colors for each pixel. You can see the sed command on the github for this post.

I saved the resulting files with a ".img" extension, and it was pretty easy to write a racket function that would convert such a file into a vector:

 (define (img->vector filename)    (define filelist (with-input-from-file (string->path filename)              (λ ()               (define (iter file-list line)                (if (eof-object? line)                  file-list                  (iter (append file-list (list line)) (read-line))))               (iter '() (read-line)))))    (list->vector (map hexstr->number filelist)))  

The perceptron itself is not even interesting. It's just a function that takes the sign of a vector dot product:

 (define (linear-classifier img)    (sign (dot sigma img)))  

And the dot product function is just an application of the built-in vector map function:

 (define (dot v1 v2)    (apply + (vector->list (vector-map * v1 v2))))  

Of course, the real "machine learning" bit is in the learning algorithm. This too is just a straightforward description of the above into lisp:

 (define (train-perceptron img-name)    (define label (filename->label img-name))    (define vec (img->vector img-name))    (cond ((= (linear-classifier vec) label)        (display "good!\n"))       (else        (set! sigma (vector-add sigma (scalar-mult label vec)))        (display "bad! updated sigma, though.\n"))))  

Where vector-add and scalar-mult are more applications of the vector-map function in Racket's library:

 (define (vector-add v1 v2)    (vector-map +          v1          v2))   (define (scalar-mult s v)    (vector-map *          v          (build-vector (vector-length v) (λ (x) s))))  

One final detail, I defined sigma as 625 element vector initialized to 1

 (define sigma (build-vector 625 (λ (x) 1)))  

I had 25 training images for this perceptron, and it still wrongly identified my unseen images. I noticed that $\boldsymbol\sigma$ was still changing if I trained it repeatedly on the same data, and it didn't stabilize until around the tenth time. Maybe it needs more training data, maybe the perceptron is a bad algorithm for this task, I really don't know. In any case, after repeated training, it improved slightly, but still had a false negative. In any case, we have our first simple supervised machine learning algorithm. Code for this, including training data, is on my github.

There's to particular reason I used Racket/Lisp for this: it would have been just as straightforward to write it in Ruby, Python, Javascript, et cetera. I just happen to like writing in Racket.

Edit:

An intelligent commenter on Hacker News pointed out that there are some theoretical guarantees for the accuracy of your perceptron when it encounters new data. I probably should have covered this here, but because I'm out of time and eager to do other things, I'll just mention that one can prove that the learning algorithm will always settle on a specific $\boldsymbol\sigma$ after a finite number of training points $k$. Additionally, one can think of $\boldsymbol\sigma$ as a plane dividing points in space, where on one side of the plane are points in our class, and the other side are all the other points. Theoretically, there will always be a distance $d$ between the closest point in our class and the plane. $k$, of course, will depend on $d$; id est, the smaller your $d$, the harder the learning problem, and the more data you'll need to train. (To the best of my understanding!) The proof of all these things is pretty straightforward, so long as one is comfortable with basic geometry and vector-based arguments. No higher maths needed. In fact, it's covered in the second set of lecture notes!