The Soul of R

Here’s a post on R, that I started writing some time ago but never got round to posting or finishing! I’ve done some good stuff in R that I will post up as and when I get round to it to follow up on this…

———

So I have a new love in my life, I’m going to sell the Gibson Explorer, give up guitar playing and spend all my time and money working with R (actually maybe not it was a present from the wife and she’ll kill me if I do).

About 10 years ago, I did an Open University course called “Artificial Intelligence Programming using Common LISP” – just for the hell of it really, I liked the title and more to the point once I got going I loved the LISP language. It’s a bit like Human Languages, different ones make you think in different ways… Latin and similar like Spanish and Italian have an elegant brevity about them, the fact you can get subject, object, verb, mood and tense all in one or two words is so different to English or something like German where a nod of the head and the word “Pen” (obviously use the German word “Kugelschrieber”!) will mean someone will pass you a pen as opposed to the English equivalent which goes something like this

“Ah-em” <clears throat> “Sorry, excuse me, if it’s no problem, would you mind possibly passing me that pen on the table over there… Sorry”  <stranger passes pen> followed by a “thank you very much” and a least one more “Sorry”

not saying that’s wrong in fact if you’re British it is of course the only way to do it, anything else would be extremly rude and could end  in a fight :-)

I wrote a program to play noughts and crosses in LISP which consisted of one line of code about 20 words and at least 80 brackets, it was beautiful and very hard to beat – must dig out the source code. Anyway, R is very nice – very much like LISP, why use 1 bracket when you can use 20 and include a variety of types { [ (  it allows you to think in a different way, hides a lot of the complexity of data structures and has some killer built in functionality. To get back to the point of this post, Functional programming, a long time unfashionable backwater compared to the deluge of Object Oriented languages, is really powerful in a MapReduce situation and in fact if you have ever read the original and follow-up scientific papers from Google you will know that the Map and Reduce functions are based on the LISP (and other) equivalents.

My only problems with R now are there are so many functions built by so many diverse groups, it’s quite daunting to get a grip on just what is possible and needs plenty of time to study them.  For now I’m sticking with 3 books covering most of the basic functionality “R in Nutshell”, “Statistitcs, an Introduction using R” and “Graphing in R”

 

And now for something completely different

So, here goes for my monthly catch up on the posts I’ve been writing and not posting or mostly not even writing looks like this is going to be a regular pattern and fits what I’m thinking of as a kind of slow buildup technical tourettes… basically I read stuff, research stuff, code stuff, learn stuff and it gradually builds up until I start randonly saying things like “Continuos Wavelet Transformation” or “long live functional programming” which scares the children and earns disapproving tuts from the wife – especially if she was talking about where No. 1 son is going to University in September at the time!

Anyway, I’ve been working through the excellent O’Reilly books on Social data mining and there’s a lot of code written in Python. I’ve never used it before so I’ve been pleasantly surprised at how useful it is and I love the way it forces you to indent code properly rather than use curly brackets (like C++ or Java) . Of course you have to get over the whole Monty Python thing… so get on Youtube look up all the old favourites about dead parrots, comfy sofas, african or european, it’s a mere fleshwound and I didn’t expect the Spanish Inquistion – get it all out of your system and CONCENTRATE!

I don’t intend to become a fully-fledged snake charmer and learn the ins and outs of Python in depth but the code recipes in the O’Reilly books are extremely useful… As I’ve now started working at Teradata (Did I mention that before?? subject for another post but believe me, there’s been a lot of champagne-drinking and celebrating going on) I’ve been re-wrting some of the code I had in SQL Server, I had a lovely little CLR function that you could call in a SELECT statement to get data out of Twitter… this Python code does the same job in about a 1/4 of the text and with no buggering about compiling DLLs

import twitter
twitter_search = twitter.Twitter(domain=”search.twitter.com”)
search_results = []
for page in range(1,10):
    search_results.append(twitter_search.search(q=”#tduniv”, rpp=100, page=page))
for page in search_results:
    for tweet in page['results']:
        print tweet

The List data structure in Python is very flexible and makes a lot of the kind of things you’d do with Twitter data nice and easy, such as word counts, lexical diversity and even some of the advanced graphing. You need to install the twitter api functions first but that’s simple enough, I’ve set this code up to run in a windows scheduled task, dumping the results to text every few hours to harvest tweets about the Teradata Universe conference I’m at over the next few days – let’s see what happens

 

 

Hadoop on Azure

The Microsoft CAT guys have given me access to the beta test of the Hadoop on Azure system to check it with the huge Proteomics data sets that the Life Sciences lab at the University of Dundee are kicking out daily. Here’s a quick run down of how it works, big thanks to the ever enthusiastic Denny Lee for helping out!

On first logging in with my windows live ID, I was amazed at how simple the interface is, just pick your cluser size and off you go…

Then you wait a few minutes while the cluster is spun-up and hey presto you’re off and running with this console.

The interactive console gives you javascript access through the webpage and is actually pretty powerful once you get the hang of it

All of the above works from my iPad and Windows Phone… the amount of technology and sheer brain power that’s gone into me being able to bring up a Hadoop cluster and run a MapReduce job on data from proteomic experiments, while walking the dog over the park is quite staggering!! Imagine the other mere mortal park-goers giving a cheery “Morning” as they go past, with no idea I have the power of 256 CPUs and 16 TB of data in my hand <Evil Genius> Mwah Ha Ha </Evil Genius>

The remote desktop function is exactly what I wanted, you can RDP into the namenode, open up a command window and off you go with all the freedom to create and run jobs in Pig, Hive, MR you’d have in a local cluster… here’s our favourite wordcount, just for old times sake :-)

The icing on the cake is the Hive ODBC driver for Excel. Creating a Hive table on the data set I had was a painless process with just a single command, then download and install the excell plug-in… check it out.

Suddenly I’m back in the “real world” of my PC but connected up to a 40Million row data set in the cloud (yes, I know Powerpivot can do that locally but this is just a test!!!)

Overall I’ve been very impressed with the ease of use and stability of the setup. I need to create a decent MR job to run a proper test but that’s going to take a few weeks.

 

London Hadoop Users Group

Man I’ve been slack posting this up, its been almost a month since I attended the London Hadoop Users Group in Oracle’s offices. It’s been a busy time though, I’ve spent a week up at the University of Dundee sorting out my PhD enrolement, picking up a load of test data and some guest lecturing to the current MSc Students.

Oracle were good hosts – they say you should know your audience and they were spot on with plenty of complimentary pizza and beer.

David Rajan from Oracle kicked off, I’ve seen him present a couple of times before and he’s a very natural speaker. You could tell he was trying desperatley hard not to slip into sales mode and to be fair he did pretty well. The presentation (with some odd ppt slides that looked like that been made on an ’80′s Atari ST) was about Oracle’s “Big Data Appliance” a complete package with Cloudera’s Hadoop on Linux and interestingly an R distribution.

It sounds like Oracle have put a lot of work into making something that can be understood at a “Corporate Level” then wheeled-in and plugged-in – obviously we all know that’s when the real work actually starts! The usual High-availability and Scalability is on offer with some interesting adaptations allowing updates and also a choice of consistency levels from Eventual to ACID.

Dan Harvey, the meetup organiser, presented an Intro to Hadoop. Considering the huge range of experience from complete novice to batlle-hardened MapReducer in the room, he did well to keep everyones attention while explaining the Hadoop Eco-System.

It was interesting to hear about the roots of Hadoop and the work done by Jim Gray on sequential disk reads. My fellow-student Tony Rogerson of the sqlserverfaq fame referenced this work in his Masters thesis… There were a few things in the eco-system I’ve not heard of that I must follow up on such as Chukwa and Whirr.

Finishing off was Matt Wood from Amazon, his presentation style could be described as “Chipper” with a very British “Jolly Good!” at the end :-) he described the Amazon Elastic Map Reduce service, which is bleedin’ complicated, and the new DynamoDB NoSQL managed database. Concentrating on performance and the value for money over an in-house solution. TBH I did get lost a bit with the way the Amazon charging structure works. Matt has a PhD in Bioinformatics and has worked on the Human Genome project, I must try to catch up with him some time over the proteomics work I’m starting.

I couldn’t make the pub meetup yesterday, so the next meeting is on the 22nd March with a Hive Theme… I’ll be there

You absolutely HAVE to read this post

Since listening to Gary Short’s presentation at the DunDDD conference, I’ve been playing around with analysing Twitter data on my home setup. I can get the data, load it into Hadoop and have some MapReduce jobs to clean it up (basically remove duplicates)

Most of the metrics were pretty simple to reproduce e.g. number of tweets per hour, biggest retweeters even lexical diversity over time but the sentiment or opinion side has been eluding me – until now that is…

Read this.

http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

Learn a bit of R (I need to as well!)

and get your sentiment scoring working. Jeffrey Breen has done an excellent job of describing this complex area. Right I just need to find a spare few days when I can get stuck in to it!

Next Page »