This week my SICP Study Group is looking at functions as first-class objects. This means, among other things, that functions can be passed as arguments and returned as values of other functions. A few members of my group feel that the material covered is too theoretical and has little real-world value. I'm always a bit bemused by this, because I use ideas from SICP nearly every day, especially when it comes to higher-order functions. So I thought I'd use an example from my day job to show how this section of SICP is particularly useful.

Part of my job includes writing data-entry applications so that other team members can interact with my statistical models. I write these applications in Python, using a library called web.py. In this library, we can assign to each web request URI a class to handle it.

What's a Class?

It's something we haven't seen before in our study group! I don't want to explain it in detail (you can check Wikipedia), but I hope it'll suffice to say this:

A class can be thought of as a way to capture, isolate, and reuse state between a group of functions. "State" is another concept we haven't seen in my group! We've only written stateless, functional programs. "State" here means data that exists independently of the functions, but is sometimes used or modified by the functions to perform its computation (exempli gratia, the balance of a bank account might be state between a group of functions that affect one's bank account). For instance, say we have a web server handling requests from the internet. Each request needs to perform a series of computations related to that request, but the data for each request, id est, the state of the request, should be kept separate from each other. This is exactly why web.py uses classes for requests. We can define what data or state our groups of functions need to share, and we can make copies of those groups (called "instances" of the class) that won't interfere with each other's state.

What's a Class look like?

Classes in Python have a name, some data members, and some functions. Classes to handle web requests with web.py generally look something like this:

class RequestHandler:
    request_var1 = "data..."
    request_var2 = "more data..."
    def GET(self):
        i = web.input()
        #code to handle request...
        return html_output

Web.py will look for a function GET inside the class when the web request comes with GET headers. Each time the application gets a web request, it creates a new instance of the class, id est, a fresh copy of the data and functions, and calls the new GET method with the new data.

In this specific example, the class is really just a special "wrapping" around the GET function so web.py can use it. It won't hurt the reader who is unfamiliar with classes to regard them as functions for the remainder of this post.

Now, classes to handle web request are often a lot more complex than that. In my case, I have specific requests that should respond with data that contains the current parameters of my statistical model, and the data should be formatted for easy entry into a web application (exempli gratia, in JSON). I have about two dozen of these requests that ask for different aspects of the same object. Hence if I had to write them all out, each one would look something like this:

class YetAnotherRequestHandler:
    def GET(self):
        # code to process incoming request data and check for errors
        # 5-10 lines ***same for every request***.
        ...
        # code that's specific to this request, 2-3 lines.
        ...
        # code to transform output into JSON-ready format
        # another 5-10 lines that are **the same for every request***
        return json_text

I don't want to burden the reader with those 20 lines of code that are the same for each request (nor do I want to show my employer's source code or think of my own contrived example!) but I hope I can make the point clear: each of these two dozen request handlers, they share a significant amount of code. If I were to write them all out by hand, I'd have over 80% duplicated code. Boring. Hard to read. Messy to change all at once.

Higher-order functions...erm, Classes

What if, instead of writing two dozen classes that share the same code, we could somehow abstract the similarities and only write the difference? Say, we could write a series of functions that do the specific-to-request bit:

def processor_for_request1(args...)
    # 2 lines of code here
    return data
    
def processor_for_request2(args...)
    # another 2 lines of code here
    return data
    
# and so on...

And pass those functions into some machine that will turn them into classes? Well, we can! Luckily, in Python, functions and classes are first class citizens, so all the techniques in our section of SICP apply here. We can write a function that takes these processors as input and returns a class that uses them like so:

def request_class_creator(processing_function):
    class Abstract_Request:
        # code to process incoming request data and check for errors
        # 5-10 lines ***same for every request***.
        data = processing_function(args...)
        # code to transform output into JSON-ready format
        # another 5-10 lines that are **the same for every request***
        return json_text
   
Request_Handler1 = request_class_creator(processor_for_request1)
Request_Handler2 = request_class_creator(processor_for_request2)
# ...and so on

In this way, we only write once the code that is different, we can reuse the components that work, and we can easily see what the important parts of our program are. Additionally, we can easily write separate unit tests for each processing function and for the Abstract_Request handler. We obviously aren't covering unit testing in the SICP study group, but it's an important part of writing good software nevertheless.

While the language and the example is different, this is the exact same technique described in Sec 1.3 of SICP. In fact, it's where I learned it from. Python users call this technique a metaclass. Python includes a few constructs to expand how one might use this method, but I think it's safe to say that the basic ideas of it are well-explained in SICP.

So having missed Stanford's Machine Learning course (mostly out of laziness - I'm sure it was great) I'm trying to learn this stuff on my own. I'm going through MIT's Machine Learning notes on OpenCourseWare. They're easy [for me] to digest without being insulting, and they help me avoid searching for "The right book" to learn from (a task that would delay my learning anything but make me feel busy).

After reading the first two lectures I decided I should stop and practice what I've learned: a simple perceptron learning algorithm.

What's a Perceptron anyway?

It sounds like a Transformer. It's actually a method of deciding whether an object belongs to a certain class based on a linear combination of that object's characteristics. Id est, say we have some data on atmospheric conditions, exempli gratia humidity H, wind speed W, et cetera and we want to classify those methods as to whether the conditions will produce a storm a not. We would have a linear term:

$a_1 H + a_2 W \ldots $

We want to choose the variables $a_i$ so that the above term is positive when we'll have a storm, and negative otherwise.

More generally, say we have a vector of characteristics $x$. We want to choose another vector $\boldsymbol\sigma$ so that the dot product $\textbf{x} \cdot \boldsymbol\sigma$ is positive when $x$ belongs to our class and negative otherwise. The perceptron is the function

$ f (\textbf{x}) = \text{sign}(\boldsymbol\sigma \cdot \textbf{x}) $

How do we find $\boldsymbol\sigma$?

Our learning algorithm will tell us how to choose that $\boldsymbol\sigma$. To do this, we need some training data. Id est, some vectors $\textbf{x}_1, \ldots, \textbf{x}_n$ along with labels $y_i$ where $y_i = 1$ if $\textbf{x}_i$ belongs to our class and $y_i = -1$ otherwise.

We're going to start with any ol' $\boldsymbol\sigma$, say just a vector with 1's in all positions. For $\textbf{x}_1$, we look at $f(\textbf{x}_1)$. If it equals $y_1$, we do nothing. If not, then we update $\boldsymbol\sigma$ with $\boldsymbol\sigma = \boldsymbol\sigma + y_1 \textbf{x}_1$. One can prove that this update rule will ensure that $f$ gets better at correctly classifying each $\textbf{x}_i$. We repeat this for all of our training data, and then we have a perceptron that's ready to classify some samples!

Let's see it in practice.

This is a pretty simple algorithm and won't take us long to implement, so let's try it. For our implementation, we're going to look at 25x25 pixel black and white images. We'll train a perceptron to identify which images are of the letter A and which aren't.

I [rather foolishly] created my own training data. I used GIMP to create 25x25 pixel images, saved them as an html table. Each cell of the table corresponded to a pixel, and GIMP saved them so that each td tag included a BGCOLOR attribute. I then used sed to convert each html file into a list of colors for each pixel. You can see the sed command on the github for this post.

I saved the resulting files with a ".img" extension, and it was pretty easy to write a racket function that would convert such a file into a vector:

 (define (img->vector filename)  
  (define filelist (with-input-from-file (string->path filename)  
            (λ ()  
             (define (iter file-list line)  
              (if (eof-object? line)  
                file-list  
                (iter (append file-list (list line)) (read-line))))  
             (iter '() (read-line)))))  
  (list->vector (map hexstr->number filelist)))

The perceptron itself is not even interesting. It's just a function that takes the sign of a vector dot product:

 (define (linear-classifier img)  
  (sign (dot sigma img)))

And the dot product function is just an application of the built-in vector map function:

 (define (dot v1 v2)  
  (apply + (vector->list (vector-map * v1 v2))))

Of course, the real "machine learning" bit is in the learning algorithm. This too is just a straightforward description of the above into lisp:

 (define (train-perceptron img-name)  
  (define label (filename->label img-name))  
  (define vec (img->vector img-name))  
  (cond ((= (linear-classifier vec) label)  
      (display "good!\n"))  
     (else  
      (set! sigma (vector-add sigma (scalar-mult label vec)))  
      (display "bad! updated sigma, though.\n"))))

Where vector-add and scalar-mult are more applications of the vector-map function in Racket's library:

 (define (vector-add v1 v2)  
  (vector-map +  
        v1  
        v2))  
 (define (scalar-mult s v)  
  (vector-map *  
        v  
        (build-vector (vector-length v) (λ (x) s))))

One final detail, I defined sigma as 625 element vector initialized to 1

 (define sigma (build-vector 625 (λ (x) 1)))

I had 25 training images for this perceptron, and it still wrongly identified my unseen images. I noticed that $\boldsymbol\sigma$ was still changing if I trained it repeatedly on the same data, and it didn't stabilize until around the tenth time. Maybe it needs more training data, maybe the perceptron is a bad algorithm for this task, I really don't know. In any case, after repeated training, it improved slightly, but still had a false negative. In any case, we have our first simple supervised machine learning algorithm. Code for this, including training data, is on my github.

There's to particular reason I used Racket/Lisp for this: it would have been just as straightforward to write it in Ruby, Python, Javascript, et cetera. I just happen to like writing in Racket.

Edit:

An intelligent commenter on Hacker News pointed out that there are some theoretical guarantees for the accuracy of your perceptron when it encounters new data. I probably should have covered this here, but because I'm out of time and eager to do other things, I'll just mention that one can prove that the learning algorithm will always settle on a specific $\boldsymbol\sigma$ after a finite number of training points $k$. Additionally, one can think of $\boldsymbol\sigma$ as a plane dividing points in space, where on one side of the plane are points in our class, and the other side are all the other points. Theoretically, there will always be a distance $d$ between the closest point in our class and the plane. $k$, of course, will depend on $d$; id est, the smaller your $d$, the harder the learning problem, and the more data you'll need to train. (To the best of my understanding!) The proof of all these things is pretty straightforward, so long as one is comfortable with basic geometry and vector-based arguments. No higher maths needed. In fact, it's covered in the second set of lecture notes!

mebassett programming

Wednesday, February 8, 2012

Metaclasses, first class objects, and a lesson from SICP.

What's a Class?

What's a Class look like?

Higher-order functions...erm, Classes

Sunday, February 5, 2012

Perceptrons in Lisp (A simple machine learning exercise)

What's a Perceptron anyway?

How do we find $\boldsymbol\sigma$?

Let's see it in practice.

Edit:

About Me

Blog Archive