The New Virtual Cluster

I ‘ve decided to go virtual, it’s not just that the old cluster takes up a load of room, makes a noise like a jet engine and makes the lights dim in the house. It’s more because I want more flexibility, there are so many interesting things happening at the moment and so many different versions and options that I want to be able to fire up a cluster, shut it down and bring up something else all on the same hardware for benchmarking (and space!) reasons. I got the excellent book “7 databases in 7 weeks” at christmas, some of my classmates at Dundee have run clusters of Mongo and Cassandra which I want to look at and I also want to have a play with Neo4J for the graphing side of protein interactions

So with that in mind I gave the credit card a good workout at overclockers.co.uk and ended up with this lot.

8-core AMD piledriver 4*2Tb disks, 256Gb SSD, 990FX motherboard, 32Gb RAM, a cool case and a fan with blue LEDs (couldn’t resist a bit of bling). I already had a 1Tb drive on another PC that I’ve recycled into this setup too.

Here it all is, just as the fun bit begins – I love building PCs, it looks like you’re a genius at work whereas in fact components are so easy to work with now it’s a doddle to do…

So we’re all up an running, I’m using Windows 7 on the host and I’ve gone for VMware to run the virtual images. The only reason for picking VMWare is that I know how to configure it and I’ve got a Teradata Aster (for work) setup running in VMWare so it made sense.

first of all get a nice Ubuntu template VM up and running, I followed my own instructions of this site to get a good base image running. There are a couple of changes as follows. I’ve used the 64bit version of Ubuntu as I’ve got plenty of RAM to play with

JAVA: I’ve decided to go for the open JDK this time, mainly because it’s a simple $sudo apt-get install openjdk-6-jre from the command line, doesn’t seem to be any compatibility issues so far

NETWORK: I added a second NAT network adapter, the idea is to have one dynamic ip address which will allow the VM to connect to he internet and one with a hardcoded IP address on the 192.168.100.xxx range which I can add to the hosts files in linux which will let the VMs communicate with each other. It also means I can run the Hadoop monitoring tools in IE on the host machine

So armed with my pre-configured template-VM, I copied it onto each of the 4 2Tb drives, setup one with 1*core and 4Gb ram for the namenode and three with 2*cores and 4Gb ram for the datanodes. I fired them all up and spent a bit of time getting the network connectivity all sorted, my /etc/hosts/ file looks like this

#127.0.0.1 localhost
#127.0.1.1 ubuntu
192.168.100.11  hadoop1 localhost
192.168.100.12  hadoop2
192.168.100.13  hadoop3
192.168.100.14  hadoop4

Note that the loopback 127.0.0.1 address are commented, this saves you a load of problems when you come to firing up the cluster later.

I downloaded the Apache version of Hadoop 1.0.4 (latest stable release) and followed the instruction on the Install Hadoop tab, there are a few tools around to help automate this process but I wanted to get stuck in and do it all manually, it’s been a while since I did this so I wanted to get back in to what all the config. steps are. A few things have changed since the last version I installed 0.23 – such as it’s no longer HADOOP_HOME but HADOOP_PREFIX but it’s more or less the same. I did make the effort to set up rcp this time and it made things easier copying the config files around.

The install all went pretty smoothly, I did have one new error that I’ve not seen before… when trying to run my first MapReduce job, it failed with this error.

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

it turns out this was caused by me changing the amount of memory available to each task in the mapred-site.xml file, I deleted this bit and we’re off it’s all running and looks pretty cool too!

I checked out Michael Noll’s excellent site for some advice on Benchmarking and ran the tesdfsio and terasort benchmarks.

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

this is the result from the read benchmark

13/02/23 13:27:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/02/23 13:27:38 INFO fs.TestDFSIO:            Date & time: Sat Feb 23 13:27:38 PST 2013
13/02/23 13:27:38 INFO fs.TestDFSIO:        Number of files: 10
13/02/23 13:27:38 INFO fs.TestDFSIO: Total MBytes processed: 9212
13/02/23 13:27:38 INFO fs.TestDFSIO:      Throughput mb/sec: 73.95932510644002
13/02/23 13:27:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 92.9056167602539
13/02/23 13:27:38 INFO fs.TestDFSIO:  IO rate std deviation: 47.10222517657025
13/02/23 13:27:38 INFO fs.TestDFSIO:     Test exec time sec: 61.645

 

looks OK to me but if anyone has any info. on what’s good/bad/indifferent in terms of IO – I’d love to know.

Next Steps:

  • Install the rest of the eco-system (pig hive etc. etc.)
  • set up Eclipse and the hadoop plugin (apparently it’s quite a challenge with 1.0.4)
  • make sure all my current MR jobs still run as expected.
  • Crack on with some pratical work on the PhD

 

 

 

 

 

 

The Data Scientists guide to buying a Glastonbury ticket

So the tickets sold out in a record time of just over 1 hour 40 minutes this year. For the uninitiated every year (except for the 4 yearly fallow years like 2012) the ordeal of trying to get tickets to the Glastonbury Festival gets more stressful, this year over 2 million people were trying to buy the 150,000 tickets at the same time… just how do you maximise your chances of success?

1) Get the right kit

There’s no use getting up and 8:45 on Sunday morning, switching on the PC and hoping to get straight to the booking page. Oh no there is a lot of planning involved, just remember if you do the same thing as everyone else you will have the same chance of success as they do and I don’t like those odds….

I first thought I would use my hadoop cluster and have 12 pcs hammering away at the f5 key hoping to get the elusive booking page to show (instead of the dreaded 404 page not found because the server is hopelessly overloaded) but instead I went for speaking to my neighbours and asking if I could “borrow” their wireless internet for the morning, now you have to get on and be trusted by people to ask this sort of thing but I think they took pity on an ageing rocker and so on Sunday morning I had 3 laptops each attached to a different internet connection plus the iPad on a 3g card. Now that’s 4 Internet connection through 3 seperate ISPs, 4 seperate IP addresses – that has to be better than however many tabs on however many browsers on a single or multiple PCs through one Internet connection, doesn’t it?

Next what browser… a bit of research and Chrome and Firefox seem to have had better success than IE but people would say that wouldn’t they! anyway I went with a fresh install of chrome, so no spyware, add-ons etc. to slow down my pageloading and for the hell of it no Virus protection – I figure I’m only going to be hitting one page and only for a couple of hours so what could possibly go wrong :-)

2) Don’t play by the same rules as everyone else – have some “inside” knowledge

So if you think about it, there’s no way that whoever is selling tickets for Glastonbury sets up a system to do so and then leaves it like that… the things are only on sale once a year for a couple of hours but in that time over 2 million people hit the system. This will have to be a special setup just for this purpose ( which if you work in IT, you might think I wonder if it was tested properly!) Now you are going to need some load balancing, one server isn’t going to handle this lot. It turns out that some plum who set the system up made a mistake in the DNS entries. Instead of typing 194.168.xxx.xxx he’d put 192.168.xxx.xxx for one of the servers- actually an easy mistake to make because many internal networks do start 192 in fact you’re home network almost definitly does.

This meant that the entire load of browser requests was going to only one of the servers, the other one was sitting there doing nothing, now some bright spark either worked this out or got some inside info. it was then a simple matter of editing the host file on your PC to point to the 194.168.xxx.xxx missing from the agencies erroneous DNS entries and voila you had a big fat server all to yourself ( and a few 1000 others who’d got the info.) and the tickets were as good as in your hand!

3) Know where people in the Know hang out and use their knowledge

Now I wish I could claim that I worked out the above myself, maybe by writing an awesome MapReduce job and running it on my home cluster but I didn’t. However I did know where the sort of people that would work this stuff out would be and how to get the info. There are forums out there on the internet where incredibly knowledgeable people post about all sorts of things. If you turn up as a newbie and ask a question your likely to get a “did you use search?” or “let me google that for you” type of answer. If you ask the right question to the right person in the right way then there’s some awesome info. around.

A forum exists for festivals and Glastonbury is one of them, if somebody works out how to get in through a backdoor that’s where the info. will be. It was this year and I expect will be next year too.

4) Take the risk

Actually changing your hosts file to point at an IP address provided by someone on the internet that you don’t know and then using that site to handover £400 (8 deposits of £50) is an insane thing to do, pretty stupid.

No Really don’t do it, unless you are happy to risk losing your money (and maybe identity etc.) but there are those who knew of the backdoor and either wouldn’t take the risk or didn’t have the tech know-how to change their hosts file – they may well be sitting at home watching the TV next June instead of standing in the rain in a muddy field :-)

5) The Devil take the hindmost

After the backdoor had been discovered there was around 20 minutes of activity before the ticket agency sorted their IP addresses and opened up all the servers to outside world. This meant if you had edited your hostfile just before or after the addresses were sorted you were now at a disadvantage, being directed to only one server instead of being load balanced and if you don’t remember to take the entry out of your hostfile then next year the chances are you’ll be trying to hit a server that doesn’t exist anymore!!

Big Data?

So what has all this got to do with Big Data? not much maybe but I like the story, I like the fact that finding a small amount of information out of a huge mass (as in the internet) can greatly increase your odds of success. Information which you can gain through an un-orthodox method, either working it out from other information that is not directly related to your problem or information which you can get from a community of experts. That information may lead to a risky path that could fail but one of the Big Data stories is the “fail fast” method.

To paraphrase some comments from Glastonbury chat forum

The days of one person logging on and getting a ticket for themselves are long gone. You need to get into groups and you need to have Facebook, Twitter, MSN and forums open so you can get information on how to get tickets. (now that sounds more like Big Data!)

There are 10 types of people in the world, those with Glastonbury tickets and those without (a nice twist on an old geek joke!)

Like I said at the beginning if you do what everyone else does your chances of success are going to be the same as theirs.

See you at the Cider Bus!

PS there has been talk about whether the Host file backdoor hack was possible because of an error on the part of the IT setup or a side effect of the way the ticket agency had tried to limit traffic… not sure which is true but saying it was an error makes a better story :-)

Setting up a development environment

Back to hadoop programming for a bit. Once you’ve had your fill of wordcount examples and messing around with notepad or even notepad++ you’re going to want to set up a decent development environment to write, debug and test MapReduce code. I actually think it’s worth going through the manual steps of writing code, compiling jars and running them in a linux command shell for a while until you’re sure you know what’s going on but it quickly becomes a PITA!

So I like eclipse for java development and I use it for work as well so it doesn’t seem worth learning to use another IDE. There’s an excellent tutorial on setting up an hadoop environment under windows using cygwin here.

http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

I’ve gone for a full windows setup because again that’s what I use at work at the moment. I’m after some more RAM in my laptop, once that’s sorted I might look at having a permanent linux VM running with the dev. environment for Hadoop, aster and splunk all in one place but for now Windows it is…

The above tutorial also describes how to set up eclipse to connect to the hadoop instance. I had some issues with the setup (as you’d expect!) first off make sure you go for Hadoop 0.20.2 or less. I tried it first with hadoop 0.20.203 and got the dreaded taskTracker failure when changing permissions on the tmp/mapred directory – it’s a bug/incompatibility with cygwin and there’s no way round it apart from going for an earlier version of Hadoop (http://herdagdelen.com/hadoop-on-windows-with-cygwin/) I also had some issues configuring ssh to do with user permissions etc. in the end I had to uninstall cygwin and start again (following the instructions properly this time!) to get it to work.

Once Hadoop is running, eclipse integration is nice and easy using the plugin supplied in the Hadoop distribution, developing a MapReduce program you’ll get a load of errors about deprecated MapReduceBase classes and methods, this is because the plugin hasn’t been updated and to be honest neither have most of the tutorials on the internet! considering the change to the api was a couple of years ago you should be able to find decent examples of the new API.

Here’s some info. http://sonerbalkir.blogspot.co.uk/2010/01/new-hadoop-api-020x.html and here’s a simple example using our old friend wordcount http://stackoverflow.com/questions/8603788/hadoop-jobconf-class-is-deprecated-need-updated-example

OK, so it’s just a case of installing Cygwin (don’t forget to include ssh in the install) setting up ssh, setting up hadoop (pretty much the same process as described in the install Hadoop page for linux) installing eclipse, configuring the plug-in, connecting to the local cluster, writing a job and running it :-)

mind you it is well worth the effort, I’ve managed to get Maven and Mahout configured as well so now I can write and run all sorts of jobs direct on my laptop.

Thanks again to http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html it must have been a lot of effort to put those pages together with all the screenshots

Some Links and more on Sqoop installation

what with the PhD work and starting a new job Teradata, I’ve been so busy learning lots of new exciting stuff that I haven’t had my writing head on this month. So here’s a bit of a copout and a load of links!!

Nice use case for Hadoop with some actual code showing a conversion process

http://hadoopinku.wordpress.com/2012/02/19/image-to-pdf/

it’s based on the famous New York Times job that used the Amazon platform

http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

I’ve just joined a new group on LinkedIN

Hadoop User Group (No SPAM, Vendor Neutral, Actively Moderated)

it’s well worth joining if you’re interested in Hadoop, the old one was just too full of adverts for ipads, iphones and russian dating agencies to be useful!

and seeing as there is still a lot of searches coming in from Google with “Install Sqoop” Here’s some links to some other great blogs – if any of them work better than my instructions let me know :-)

http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/
https://ccp.cloudera.com/display/CDH4B2/Sqoop+Installation
http://michael.otacoo.com/linux-2/install-hadoop-and-sqoop-in-lucid/
http://shout.setfive.com/2011/09/14/getting-started-with-hadoop-hive-and-sqoop/

Good Luck with it!

At the moment I’m ploughing through Mahout in action which is an excellent book, hopefully I’ll be able to post up some example soon.

 

 

 

Plug for Karlsblog

One of my classmates has some interesting posts and reviews on his blog

http://www.karlsblog.org.uk/

the latest on web scraping using Excel is a cracker. Looks like a really useful technique to scrape data off web pages without getting down and dirty with Python or something like imacro

Next Page »