The New Virtual Cluster
I ‘ve decided to go virtual, it’s not just that the old cluster takes up a load of room, makes a noise like a jet engine and makes the lights dim in the house. It’s more because I want more flexibility, there are so many interesting things happening at the moment and so many different versions and options that I want to be able to fire up a cluster, shut it down and bring up something else all on the same hardware for benchmarking (and space!) reasons. I got the excellent book “7 databases in 7 weeks” at christmas, some of my classmates at Dundee have run clusters of Mongo and Cassandra which I want to look at and I also want to have a play with Neo4J for the graphing side of protein interactions
So with that in mind I gave the credit card a good workout at overclockers.co.uk and ended up with this lot.
8-core AMD piledriver 4*2Tb disks, 256Gb SSD, 990FX motherboard, 32Gb RAM, a cool case and a fan with blue LEDs (couldn’t resist a bit of bling). I already had a 1Tb drive on another PC that I’ve recycled into this setup too.
Here it all is, just as the fun bit begins – I love building PCs, it looks like you’re a genius at work whereas in fact components are so easy to work with now it’s a doddle to do…
So we’re all up an running, I’m using Windows 7 on the host and I’ve gone for VMware to run the virtual images. The only reason for picking VMWare is that I know how to configure it and I’ve got a Teradata Aster (for work) setup running in VMWare so it made sense.
first of all get a nice Ubuntu template VM up and running, I followed my own instructions of this site to get a good base image running. There are a couple of changes as follows. I’ve used the 64bit version of Ubuntu as I’ve got plenty of RAM to play with
JAVA: I’ve decided to go for the open JDK this time, mainly because it’s a simple $sudo apt-get install openjdk-6-jre from the command line, doesn’t seem to be any compatibility issues so far
NETWORK: I added a second NAT network adapter, the idea is to have one dynamic ip address which will allow the VM to connect to he internet and one with a hardcoded IP address on the 192.168.100.xxx range which I can add to the hosts files in linux which will let the VMs communicate with each other. It also means I can run the Hadoop monitoring tools in IE on the host machine
So armed with my pre-configured template-VM, I copied it onto each of the 4 2Tb drives, setup one with 1*core and 4Gb ram for the namenode and three with 2*cores and 4Gb ram for the datanodes. I fired them all up and spent a bit of time getting the network connectivity all sorted, my /etc/hosts/ file looks like this
192.168.100.11 hadoop1 localhost
Note that the loopback 127.0.0.1 address are commented, this saves you a load of problems when you come to firing up the cluster later.
I downloaded the Apache version of Hadoop 1.0.4 (latest stable release) and followed the instruction on the Install Hadoop tab, there are a few tools around to help automate this process but I wanted to get stuck in and do it all manually, it’s been a while since I did this so I wanted to get back in to what all the config. steps are. A few things have changed since the last version I installed 0.23 – such as it’s no longer HADOOP_HOME but HADOOP_PREFIX but it’s more or less the same. I did make the effort to set up rcp this time and it made things easier copying the config files around.
The install all went pretty smoothly, I did have one new error that I’ve not seen before… when trying to run my first MapReduce job, it failed with this error.
java.lang.Throwable: Child Error
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
it turns out this was caused by me changing the amount of memory available to each task in the mapred-site.xml file, I deleted this bit and we’re off it’s all running and looks pretty cool too!
I checked out Michael Noll’s excellent site for some advice on Benchmarking and ran the tesdfsio and terasort benchmarks.
this is the result from the read benchmark
13/02/23 13:27:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/02/23 13:27:38 INFO fs.TestDFSIO: Date & time: Sat Feb 23 13:27:38 PST 2013
13/02/23 13:27:38 INFO fs.TestDFSIO: Number of files: 10
13/02/23 13:27:38 INFO fs.TestDFSIO: Total MBytes processed: 9212
13/02/23 13:27:38 INFO fs.TestDFSIO: Throughput mb/sec: 73.95932510644002
13/02/23 13:27:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 92.9056167602539
13/02/23 13:27:38 INFO fs.TestDFSIO: IO rate std deviation: 47.10222517657025
13/02/23 13:27:38 INFO fs.TestDFSIO: Test exec time sec: 61.645
looks OK to me but if anyone has any info. on what’s good/bad/indifferent in terms of IO – I’d love to know.
- Install the rest of the eco-system (pig hive etc. etc.)
- set up Eclipse and the hadoop plugin (apparently it’s quite a challenge with 1.0.4)
- make sure all my current MR jobs still run as expected.
- Crack on with some pratical work on the PhD