Tuesday, October 23, 2012

Talk in Dublin: Adventures in Functional Big Data

I had an opprotunity to speak at the IE group's Big Data Innovation Summit in Dublin last Friday. There was a good mix of people from several different industries - I think I learned more from the audience then they did from me. In any case, I thought I'd post my notes and slides here, in case someone found them useful/interesting. These are my raw notes, however, so I apologize for any grammatical or spelling errors.

Adventures in Functional Big Data:

How Data Sciences drives non-technology companies to make some cutting-edge technology decisions.

Main idea: Functional programming (when combined with research experience and core business knowledge) can extract value from big data and resolve the technical challenges by turning the data into an interactive model.

The massive amounts of data available to today's businesses affords us many opportunities to optimize our business process and reveal new ways to create value. But doing so presents a number of research and technology challenges. I want to tell a story to explain these problems, and how a combination of functional programming, research experience, and core business knowledge can be combined to resolve them and make one's business more effective.

Hi everyone. I'm Matthew. I'm the Senior Data Scientist with Universal Pictures International. I make machine-learning models to help our business run better.

I want to talk to you today about our adventures learning from big data. In particular, I want to tell you about the problems we have in getting the company on board with big data, and how I've used some unorthodox technologies (functional programming, e.g., Lisp and Racket) to bring business managers closer to the data.

To start off, I want to explain the relationship between "big data" and "the business", and then tell you a story to explain the problems these two have in getting along, and how functional programming helped bring them together.

"Big data" and "data science" are still new, vaguely defined terms. In fact, without some context of your business, "big data" and "data science" are just interesting academic ideas involving statistics and computing. It's important to keep in mind that it's the business domain that makes it come alive.

Now, if you're a technology company, like Google or Zynga, your "big data" likely comes from web and mobile statistics, A/B testing, et cetera. Your "data scientist" are combing through it, looking for usage patterns and trends to exploit, so you can improve your existing technology products, or create new ones.

But not all of us are technology companies.

Our company isn't quite like that. We're not trying to make a better product, we're trying to learn which of our products the market wants and when the market wants them. We don't have web and mobile statistics. Instead, we have a variety of different data coming from hundreds of [dirty] sources. Once we have it all cleaned up, we're hoping it can tell us how to make better business decisions. I expect that some of your businesses are similar.

But there's a problem here! There isn't a lot of "experiential overlap" between the folks who know and run the business, and the folks [like me] who know and can interpret the data.

Indeed, the directors and senior execs at our company are, of course, intelligent people who know the business inside and out. But when it comes to the statistics and mathematics behind "big data", well, they think "the standard deviation" could be the title of an upcoming film.

Even so, they're aware that there is this data is out there, and that it has these useful nuggets of value. It's a vast dark forest, like Mirkwood from The Hobbit. It's too big, mysterious, and dangerous for them to venture into.

But they know that there are these wizards, these Data Scientists, who can help them navigate the Data Mirkwood and find the treasure. The challenge here is for the Data Science Wizards to use all their sorcery, Hadoop and machine learning algorithms and the like, to shed some light on the Data Mirkwood and present it in a way that makes sense of the business. We'll see soon how one extra bit of magic, functional programming, can help them do that job. But before that, how does this relationship usually play out?

Well, not like an adventure at all. It's a mess!

On the one side of the equation we've got these execs and directors who have been hearing about all this "big data" and "machine learning" stuff, and reading articles about how "data scientists" are "sexy". They're smelling ways to maximize value and want to put it to use.

Our best example of this is our Pointy-Haired Boss, let's call him Paul. Paul practically lives on data. But he doesn't go into the Data Mirkwood himself. He just asks for reports from it. And he's always asking for reports. He wants his excel spreadsheets on this and his PDFs for that. There's a lot of redundancy, and none of these bits of data are very intelligent. He doesn't want to get too deep into the Data Mirkwood. He just wants something relevant that makes sense.

On the other hand, you've got the analyst trying to prepare all these reports. Just when they have a process for generating excel spreadsheets one thing, Paul the Pointy-Haired boss asks for a PDF on another. They're busy jumping around like data bunnies in our big data, trying to clean up this data source and that data source to keep up with it all.

It's all fairly superficial stuff - the data bunnies are cleaning up and reporting pieces of the large set of data, but they don't want to see the whole thing. And they certainly can't see large scale patterns and trends that are emerging from it. Paul the Pointy-Haired Boss doesn't even ask; he doesn't know or care how Bayesian clustering algorithms would help him make better decisions, and the bunnies don't know how to do that anyway.

This is where our Data Scientist comes in. We'll call him Duncan

Duncan the Data Scientist is this mystical wizard. He knows the Data Mirkwood well, he's got deep sorcery to make it come alive. He can Hadoop and MapReduce and Machine learn the data backwards and forwards. But he also knows the business, and can direct Paul the Pointy-Haired boss to the right questions.

Duncan the Data Scientist is tasked with guiding Paul the Pointy-Haired Boss through Data Mirkwood. Paul needs to know the trends and patterns hiding in Data Mirkwood, but he lacks the computational and statistical know-how to even ask the right questions. Paul knows the business, but Duncan knows the statistics.

What does Duncan need to be successful at this?

With a CV like this, I have to say that Duncan is sounding less like a scientist and more like a unicorn! In reality, Duncan has the maths, stats, and programming background to dig deeper into Data Mirkwood than do the data bunnies, but his knowledge of the business isn't nearly as deep as Paul's.

So how can Duncan get Paul into Data Mirkwood? How can he get Paul into the data in a way that he can understand?

Duncan needs to create a usable interface over the data for Paul to use. By "usable interface" I don't mean "data visualization". I mean something that Paul can interact with, a method he can use to pose questions within the context of his business, id est, Duncan needs to use the data to create a place where Paul can ask "what does the data say about this aspect of my business"? Duncan needs to create something like a "business-domain specific language" for querying the data.

At my company, the interface we built was a simulation. We cleaned up our big data, and we continually feed it through a variety of regression and classification algorithms. These algorithms then tune various parameters of our simulation, and enable us to ask questions like "what happens if this product flops?" or "when will there be room in the market for this product to survive?". The nature of your usable interface would be specific to your business, naturally, and would come from a collaboration between your Duncans and your Pauls.

But building such an interface is no walk in the park either. Say Paul and Duncan have a good working relationship. They understand each other, and Duncan is able to translate Paul's concerns into technical questions to ask about the data. Duncan still has to turn these questions into programming experiments, and then into a usable interface for Paul. He's got to do his "data science/machine learning" programming, and somehow feed that into his "application programming".

Typically, when one is programming in big data, the languages one has at his disposal fall in sort of a grid:

On the horizontal scale we have expressive languages that allow Duncan to express his statistical and mathematical analysis for the data. But these don't scale well. R is fine for small data sets, but it chokes a bit when you try to feed it petabytes of data.

On the vertical scale you have your high powered technologies. These are the sorts of things you'd use in for MapReduce over your Hadoop cluster. These are the languages you use to take Paul's questions and turn them into massively parallel computations that find patterns in the data.

There's a bit of a gap of languages that can do both. Often, Duncan finds himself prototyping and experimenting solutions on the horizontal scale, but has to re-write them on the vertical scale.

And this is where functional programming comes in. Functional programming languages like Lisp/Racket fill that void:

Functional languages, like Racket, Clojure, Haskell, Scala, et cetera are incredibly expressive. They differ from other languages in that they don't tell the computer what to do.

For instance, in a functional language, you don't do things like "set the variable x to 5. Add 3 to x and store it in y".

Rather, in functional languages, everything evaluates to a mathematical expression. You describe programs by writing down a mathematical transformation from your initial state to your desired state.

To do your data analysis, you'd describe the mathematics and statistics behind it, defining your input data and how the output should look.

To do Paul's interface, you'd describe the information you want from him, and how to turn that into an answer from the data, and return it to him.

Racket is a new functional programming language, and is incredibly expressive. It can allow Duncan to express his machine learning algorithms and his statistical analysis, but it's powerful enough for him to launch his parallel computations over a large cluster (e.g. Amazon EC2).

But functional programming itself, and languages like Racket, aren't actually that new. Even the holy-grail of "big data", MapReduce, the algorithm that makes any of our analysis possible, is based on ideas from functional programming. The name comes from a common idiom in functional languages - to map over a data set and reduce it to something simpler.

The benefit in using Racket is that Duncan can do everything in the same stack: he can experiment and find solutions in Racket. He can do his large-scale computations over the data in Racket. And he can build an interface in Racket. Paul can access to Duncan's work more quickly.

Gone are the data bunnies. Paul is working much more closely inside Data Mirkwood. By using Racket, Duncan has built an interface on top of the data and allowed Paul access to a much more sophisticate insight, and everyone's happy.

We use functional programming, especially Racket, on a day-to-day basis at my company. Our simulation and machine learning algorithms are running on it now. I'd be happy to answer any questions you may have about it.

Post-talk thoughts

The above talk is slightly incoherent, and doesn't do a good job delivering on it's promises. In particular, I could have done the following better:

  1. It doesn't explain how functional programming is useful for building these interactive models. I just state that it's better without evidence or examples. I could have talked more about our specific setup, and highlighted where judicious use of racket made our life easier. But maybe that's best left for another talk or blog post.
  2. Additionally, the audience of the talk (and the panel discussion afterwards) were very interested in the talent shortage problem that I alluded to. Another gentleman, Jean-Christophe Desplat, Director at ICHEC, spoke about a solution in his talk: hire a team of talented people who can work together. I fully agree. That might make a good talk/blog, too.
  3. Finally, I should have discussed the analogy between a "usable interface on the data" and a "business-domain-specific language for querying the data". That's really the core bit of the talk, and I barely mentioned it!