So, I’ve finally given in to peer-pressure/marketing/fanboi-ism/greed or whatever you want to call it and ordered a macbook pro. It seems that an Apple laptop is the mark of a Data Scientist and having resisted and ridiculed mac owners for many years, I’m swallowing my pride (and killing my bank balance). Having used a few macs recently I have to admit that they are useful for more than watching films and looking cool while hanging around in Costa…. The linux based OS is what finally convinced me, developing in Hadoop is nice and simple when it’s all running natively on your machine – well that’s my excuse anyway J the projected delivery date is the 20th December so it should be a very merry Christmas for me!
we’ll see how I get on, first thing is to change the way that mouse scroll goes the wrong way (i’m sure they designed that specifically with the intention of annoying me) then find the shortcuts for things like end and home… I’ll have to look for a pc to mac dummy’s guide.
Last week was an eventful one, I started the week at the Strata conference in London working on the Teradata Stand and even had a presenting slot (more on that later). The rest of the week was spent in Holland and the UK out at client sites. Strata is a great conference, I really like the atmosphere and the attendees are always full of enthusiasm and great questions. It was good fun working on the stand, I picked up a great bit of swag from the guys at WanDisco, they have a nice product for introducing disaster recovery and replication to Hadoop –they also had those portable mobile phone chargers that are a battery you charge up and take with you for when your phone is low, saved my life at Amsterdam airport when my phone was about to give up and I needed it to show my electronic boarding pass!
So it was real privilege presenting on the same agenda as the likes of Doug Cutting, Tom White and James Burke. I mean James Burke, he was a real hero of mine from “Tomorrow’s World” and Connections. His speech was really inspirational about a future containing “fabbers” devices that are like 3D printers but can print anything, phones, tvs, cars, food, drink absolutely anything – I’ll have to look after my health so I can live long enough to see the day they become reality, he reckoned 40-odd years… My presentation was on “7 fun things to do with MapReduce” The ideas was to introduce the concepts of MapReduce programming with some ideas of what you can do and how it fits into the programming structure. It seemed to be well-received, I had some good feedback on twitter – including the usual reference to me looking like Ricky Gervaiz. As in “Interesting talk on MapReduce but slightly distracted by the presenters likeness to David Brent” – thank you!
I’ve submitted the same talk to the Hadoop World conference in Amsterdam next March – if it gets accepted come and see it.. and no, I’m not “doing the dance”
I’ve just got back from a great week up at St. Andrews where they were running a summer school for PhD students on Big Data Vis. As I’m a part-timer and mostly study on my own at home (or in airport lounges or hotels!) These kind of events are just the thing I need to remember what interaction with other students is all about. I actually met a guy working in the same field as me from Glasgow Uni, that alone could have made the whole week worthwhile
Here’s a link to the details and a picture of all the participants
It’s been a long time since I’ve been to St. Andrews and it’s just as impressive as I remember, even better that we had a whole week of sun, there really is no better place than Scotland when the weather is like that, we stayed in the David Russell apartments which were perfect for the week and the catering was good. Funny that just 3 weeks ago I drove up to Essex Uni to move my 19 year old son out of a flat almost exactly the same as the one I stayed in, although I think I did as much work in one week that he did in a whole term!
The week kicked off with Peter Triantafillou (University of Glasgow) giving an overview of Big Data, it was nice to hear a talk on this without any marketing spin for a change and I will definitely be making full use of Peter’s formula: Big Data – Relevant Data = The Crap Gap. Nice, and so true. John Stasko was visiting from Georgia Tech in the States and spoke on the value of vis for exploring and understanding data. This was an interesting well-paced lecture, he introduced us to a tool called Jigsaw http://www.cc.gatech.edu/gvu/ii/jigsaw/ which his team is working on. This something else I will definitely be using, it’s a really nice text analytics tool , very well documented and supported by publications. I can already think of a lot of uses for it.
Next up the one and only Sean Owen (forgot how much he looks like Toby Maguire of Spiderman fame) who co-wrote Mahout in Action. I checked with him and I was right Mahout does rhyme with trout!!! I must confess that when I read Mahout in Action it was Sean’s bit on parallelizing the matrix manipulation that was beyond me and where I gave up on reading it, since then I’ve completed Andrew Ng’s Mooc on Machine Learning and my Linear Algebra has come a long way… So seeing Sean talk has motivated me to get back to it and re-read the book properly. I love his comment “It’s not for faint-hearted developers” now that sounds like a challenge. I wish he could have had more time, a lot more time, to speak – he ended up skipping half the slides and they looked really interesting.
During the first day we were split up into groups depending on which data set we were interested in, I chose Social Media and went into Team Birch with Aminu, Anil, Nut (great name!) and Ruth we spent some time with Alex Voss from St. Andrews to discuss our data set (tweets on the day of the boston bombings) and ideas for the project work. Sheelagh Carpendale from the University of Calgary gave us an overview of hers and her students research and introduced a lot of new ideas, I like the Phyllotactic patterns… I’m particularly interested in network vis. as the interactions between proteins can be thought of as a graph with proteins as nodes connected by edges and as is usually the case it quickly gets extremely large. We finished the first day with a talk from Ronan McAteer from IBM on the Watson project, this was really impressive and I can’t believe how quickly IBM have gone from a stage-show with a backroom full of kit to a marketable solution that comes as a single server. I wonder who the first customer will be??
Dinner at Zizzis was all paid for by SICSA (thankyou!) and finally back to the apartments for around 10:30 – a long old first day
Adam Barker ran the first lecture on Cloud Computing, he’s a good enthusiastic presenter and the whole thing was very clear, you can now test me on the differences between iaas, paas, saas. We had Stratis Viglas from Edinburgh talk on Big Data Programming Models – this was an excerpt from his Extreme Computing course, it seemed that he had enough slides to teach the entire term’s course in one hour! We really could have done with 2 or more sessions from him because again the slides he skipped looked very interesting.
Mid-morning we headed into the lab for some hands-on exercises with EC2 Hadoop and MapReduce, I’ve not used the Amazon services before but I want to in the future so this was a great introduction for me. 2 gripes I had were I spent half my time looking for the right symbol or shortcut on those bleedin’ Mac keyboards where everything is different enough to annoy a long time PC user (OK so that’s my problem not the courses!) and we didn’t get the solutions to the exercises (although Richard did say he was going to email them through next week). I’ve always found with Java coding that I can make something work but when someone shows me a better design it always uses about a quarter of the amount of code than I did. Also there were a lot of non-java literate students so a guided example would have been useful. We made good use of the MapReduce environment in our project work after this session but I’m not sure whether the other groups did or not.
After Lunch Sheelagh talked about Visual Verbs which was something new to me but proved very useful with the project work , we also had a session to sketch out on paper what we would do for our project. This was fine as it goes but as we didn’t have strong (or in fact any!) front-end programming skills in our group we were never going to be able to code up a custom solution from scratch so our sketched designs were far too ambitious. Aaron Quigley finished up the day by talking about different Information Visualisation toolkits, it was a great session as most of these were new to me. I’ve got a big long list of things to follow up on after the course and a lot of these are on it. After dinner we spent the evening working on our projects and got most of the data cleansing and planning done.
At Lunch we went out to have fish and chips in the town, I ordered a small portion. Now I may have been the oldest student by some 15 years (yes, yes older than most of the lecturers as well) but there was no need for this… when I got my receipt it said “Pensioners Portion” on it bloody cheek the others didn’t let me forget that!!
With the very full schedule the time was whizzing past, Wednesday started with Iadh Ounis(University of Glasgow) talking about IR and Real-time analysis, the mapreduce section was a bit of repetition from earlier lectures but it’s good to hear about it from another perspective. The section on Storm was very interesting as I’ve not used it or looked at it in detail before. It would have nice to have been able to use it in our project work and build some real-time twitter analytics. Miguel Nacenta and Uta Hinrichs from St. Andrews talked about current research on visualization and interaction. It’s good to see so much focus on touch screens. When I first started the MSc at Dundee they had one of the big old Microsoft Surface cabinets there, it was great toy but there wasn’t really any good applications for it apart from demo stuff. I’m still waiting for my Minority Report interface but the work that is going on is bringing it closer. The highlight for me was the Transmogrifier that Miguel demo’d, wow that is really impressive.
We had a session delivered by students where we presented our current research, I like these sessions it’s good to hear what other people are up to and whether you can gain some useful info. or contacts. Apart from that the rest of the day was spent on project work, we came up with the 5 W’s of Social Media analysis Who, What, Where, When and Why… I really like this, it meant we could split the work up efficiently and work on different areas, Ruth came up with idea of a web portal that let us plug the different sections into it to create an integrated interface, we also finished off most of the rest of the data processing. I made some charts using Tableau public which is a nice tool to use and looks good, I was particularly impressed that it had a specific colour-blind palette for charts so that red-green colour blind people like me can actually read them! Shame Microsoft didn’t think of that when they made the SSIS interface, the joke where I used to work was that I though all of my code always worked because I couldn’t tell which tasks had failed and gone red…
After a quick catch up with the groups presenting their work so far it was dinner and then the rest of the evening working on the projects.
Only 2 Lectures today, with the rest of the day on project work. Aaron kicked off with Network visualization and next generation vis. For my part, I would gladly have listened to Aaron all day and not done any of the project work. He had enough material for at least 4 hours of detailed work and there were some complicated concepts in there and after all I was there to learn from the experts. Maybe he can be persuaded to record some lectures mooc-style :-) Per Ola from St. Andrews finished the lecturing with a presentation on crowd-sourcing this was interesting and mostly new to me, I’m now wondering if I can sign my sons up to Mechanical Turk and get them earning some money rather than just spending mine!
We had a fantastic dinner at the Swilcan restaurant overlooking the old course at St. Andrews golf club (I like the fact that the new course was opened in 1895!!) dinner was great, the wine was great and so was the atmosphere.
The last day was all about time to finish off the projects and then present our work to the rest of the teams and judges. Our effort is still online at this URL if you want to take a look
All of the teams had produced a visualisation specific to their chosen data set and the work was all very impressive with some real high quality stuff there. I’m quite proud of what we achieved in the time, particularly that we got everything looking perfect on the laptop, projector, Ipad and iphone screens. Nothing earth-shatteringly innovative but as a “Real Time Twitter Analytics Portal” it was a good job.
Overall a great experience, a mix of hard work, excellent lectures, good location and a great bunch of people. The whole week was organised really well from the accomodation, dinners, lecture room bookings, speakers, equipment and went off without a hitch. Not a single minute was wasted from 830 to 11pm each day, even the 15 min walk to and from the apartments was spent discussing project work or the days lectures… Can I book a place for next year already?
Finally here’s my twitter social graph using the metaphor of an alien bacteria…
I ‘ve decided to go virtual, it’s not just that the old cluster takes up a load of room, makes a noise like a jet engine and makes the lights dim in the house. It’s more because I want more flexibility, there are so many interesting things happening at the moment and so many different versions and options that I want to be able to fire up a cluster, shut it down and bring up something else all on the same hardware for benchmarking (and space!) reasons. I got the excellent book “7 databases in 7 weeks” at christmas, some of my classmates at Dundee have run clusters of Mongo and Cassandra which I want to look at and I also want to have a play with Neo4J for the graphing side of protein interactions
So with that in mind I gave the credit card a good workout at overclockers.co.uk and ended up with this lot.
8-core AMD piledriver 4*2Tb disks, 256Gb SSD, 990FX motherboard, 32Gb RAM, a cool case and a fan with blue LEDs (couldn’t resist a bit of bling). I already had a 1Tb drive on another PC that I’ve recycled into this setup too.
Here it all is, just as the fun bit begins – I love building PCs, it looks like you’re a genius at work whereas in fact components are so easy to work with now it’s a doddle to do…
So we’re all up an running, I’m using Windows 7 on the host and I’ve gone for VMware to run the virtual images. The only reason for picking VMWare is that I know how to configure it and I’ve got a Teradata Aster (for work) setup running in VMWare so it made sense.
first of all get a nice Ubuntu template VM up and running, I followed my own instructions of this site to get a good base image running. There are a couple of changes as follows. I’ve used the 64bit version of Ubuntu as I’ve got plenty of RAM to play with
JAVA: I’ve decided to go for the open JDK this time, mainly because it’s a simple $sudo apt-get install openjdk-6-jre from the command line, doesn’t seem to be any compatibility issues so far
NETWORK: I added a second NAT network adapter, the idea is to have one dynamic ip address which will allow the VM to connect to he internet and one with a hardcoded IP address on the 192.168.100.xxx range which I can add to the hosts files in linux which will let the VMs communicate with each other. It also means I can run the Hadoop monitoring tools in IE on the host machine
So armed with my pre-configured template-VM, I copied it onto each of the 4 2Tb drives, setup one with 1*core and 4Gb ram for the namenode and three with 2*cores and 4Gb ram for the datanodes. I fired them all up and spent a bit of time getting the network connectivity all sorted, my /etc/hosts/ file looks like this
192.168.100.11 hadoop1 localhost
Note that the loopback 127.0.0.1 address are commented, this saves you a load of problems when you come to firing up the cluster later.
I downloaded the Apache version of Hadoop 1.0.4 (latest stable release) and followed the instruction on the Install Hadoop tab, there are a few tools around to help automate this process but I wanted to get stuck in and do it all manually, it’s been a while since I did this so I wanted to get back in to what all the config. steps are. A few things have changed since the last version I installed 0.23 – such as it’s no longer HADOOP_HOME but HADOOP_PREFIX but it’s more or less the same. I did make the effort to set up rcp this time and it made things easier copying the config files around.
The install all went pretty smoothly, I did have one new error that I’ve not seen before… when trying to run my first MapReduce job, it failed with this error.
java.lang.Throwable: Child Error
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
it turns out this was caused by me changing the amount of memory available to each task in the mapred-site.xml file, I deleted this bit and we’re off it’s all running and looks pretty cool too!
I checked out Michael Noll’s excellent site for some advice on Benchmarking and ran the tesdfsio and terasort benchmarks.
this is the result from the read benchmark
13/02/23 13:27:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/02/23 13:27:38 INFO fs.TestDFSIO: Date & time: Sat Feb 23 13:27:38 PST 2013
13/02/23 13:27:38 INFO fs.TestDFSIO: Number of files: 10
13/02/23 13:27:38 INFO fs.TestDFSIO: Total MBytes processed: 9212
13/02/23 13:27:38 INFO fs.TestDFSIO: Throughput mb/sec: 73.95932510644002
13/02/23 13:27:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 92.9056167602539
13/02/23 13:27:38 INFO fs.TestDFSIO: IO rate std deviation: 47.10222517657025
13/02/23 13:27:38 INFO fs.TestDFSIO: Test exec time sec: 61.645
looks OK to me but if anyone has any info. on what’s good/bad/indifferent in terms of IO – I’d love to know.
- Install the rest of the eco-system (pig hive etc. etc.)
- set up Eclipse and the hadoop plugin (apparently it’s quite a challenge with 1.0.4)
- make sure all my current MR jobs still run as expected.
- Crack on with some pratical work on the PhD
So the tickets sold out in a record time of just over 1 hour 40 minutes this year. For the uninitiated every year (except for the 4 yearly fallow years like 2012) the ordeal of trying to get tickets to the Glastonbury Festival gets more stressful, this year over 2 million people were trying to buy the 150,000 tickets at the same time… just how do you maximise your chances of success?
1) Get the right kit
There’s no use getting up and 8:45 on Sunday morning, switching on the PC and hoping to get straight to the booking page. Oh no there is a lot of planning involved, just remember if you do the same thing as everyone else you will have the same chance of success as they do and I don’t like those odds….
I first thought I would use my hadoop cluster and have 12 pcs hammering away at the f5 key hoping to get the elusive booking page to show (instead of the dreaded 404 page not found because the server is hopelessly overloaded) but instead I went for speaking to my neighbours and asking if I could “borrow” their wireless internet for the morning, now you have to get on and be trusted by people to ask this sort of thing but I think they took pity on an ageing rocker and so on Sunday morning I had 3 laptops each attached to a different internet connection plus the iPad on a 3g card. Now that’s 4 Internet connection through 3 seperate ISPs, 4 seperate IP addresses – that has to be better than however many tabs on however many browsers on a single or multiple PCs through one Internet connection, doesn’t it?
Next what browser… a bit of research and Chrome and Firefox seem to have had better success than IE but people would say that wouldn’t they! anyway I went with a fresh install of chrome, so no spyware, add-ons etc. to slow down my pageloading and for the hell of it no Virus protection – I figure I’m only going to be hitting one page and only for a couple of hours so what could possibly go wrong
2) Don’t play by the same rules as everyone else – have some “inside” knowledge
So if you think about it, there’s no way that whoever is selling tickets for Glastonbury sets up a system to do so and then leaves it like that… the things are only on sale once a year for a couple of hours but in that time over 2 million people hit the system. This will have to be a special setup just for this purpose ( which if you work in IT, you might think I wonder if it was tested properly!) Now you are going to need some load balancing, one server isn’t going to handle this lot. It turns out that some plum who set the system up made a mistake in the DNS entries. Instead of typing 194.168.xxx.xxx he’d put 192.168.xxx.xxx for one of the servers- actually an easy mistake to make because many internal networks do start 192 in fact you’re home network almost definitly does.
This meant that the entire load of browser requests was going to only one of the servers, the other one was sitting there doing nothing, now some bright spark either worked this out or got some inside info. it was then a simple matter of editing the host file on your PC to point to the 194.168.xxx.xxx missing from the agencies erroneous DNS entries and voila you had a big fat server all to yourself ( and a few 1000 others who’d got the info.) and the tickets were as good as in your hand!
3) Know where people in the Know hang out and use their knowledge
Now I wish I could claim that I worked out the above myself, maybe by writing an awesome MapReduce job and running it on my home cluster but I didn’t. However I did know where the sort of people that would work this stuff out would be and how to get the info. There are forums out there on the internet where incredibly knowledgeable people post about all sorts of things. If you turn up as a newbie and ask a question your likely to get a “did you use search?” or “let me google that for you” type of answer. If you ask the right question to the right person in the right way then there’s some awesome info. around.
A forum exists for festivals and Glastonbury is one of them, if somebody works out how to get in through a backdoor that’s where the info. will be. It was this year and I expect will be next year too.
4) Take the risk
Actually changing your hosts file to point at an IP address provided by someone on the internet that you don’t know and then using that site to handover £400 (8 deposits of £50) is an insane thing to do, pretty stupid.
No Really don’t do it, unless you are happy to risk losing your money (and maybe identity etc.) but there are those who knew of the backdoor and either wouldn’t take the risk or didn’t have the tech know-how to change their hosts file – they may well be sitting at home watching the TV next June instead of standing in the rain in a muddy field
5) The Devil take the hindmost
After the backdoor had been discovered there was around 20 minutes of activity before the ticket agency sorted their IP addresses and opened up all the servers to outside world. This meant if you had edited your hostfile just before or after the addresses were sorted you were now at a disadvantage, being directed to only one server instead of being load balanced and if you don’t remember to take the entry out of your hostfile then next year the chances are you’ll be trying to hit a server that doesn’t exist anymore!!
So what has all this got to do with Big Data? not much maybe but I like the story, I like the fact that finding a small amount of information out of a huge mass (as in the internet) can greatly increase your odds of success. Information which you can gain through an un-orthodox method, either working it out from other information that is not directly related to your problem or information which you can get from a community of experts. That information may lead to a risky path that could fail but one of the Big Data stories is the “fail fast” method.
To paraphrase some comments from Glastonbury chat forum
The days of one person logging on and getting a ticket for themselves are long gone. You need to get into groups and you need to have Facebook, Twitter, MSN and forums open so you can get information on how to get tickets. (now that sounds more like Big Data!)
There are 10 types of people in the world, those with Glastonbury tickets and those without (a nice twist on an old geek joke!)
Like I said at the beginning if you do what everyone else does your chances of success are going to be the same as theirs.
See you at the Cider Bus!
PS there has been talk about whether the Host file backdoor hack was possible because of an error on the part of the IT setup or a side effect of the way the ticket agency had tried to limit traffic… not sure which is true but saying it was an error makes a better story