Friday, July 17, 2009

Friday's session, a finale -- "Clouds and Grids"

So the Friday session was an interesting one, to say the least. The topic varied sharply from the previous set of topics to a future-looking commercialization topic, that of virtualization of hardware and commoditization of the grid. Now, if you participated and thought I brought away the wrong concept from this, feel free to let me know.

Here's what I boiled down the gist of Friday's single session to: virtualization is a key component in rapid deployment of technologies on the grid, and the use of virtualization technologies can allow cluster administrators and grid manipulators to move resources to where they can be most effective in a logical fashion, all while avoiding the underlying failure of physical machines as much as possible. The focus was on technologies like KVM, Xen, and VMWare and organizations like Amazon (EC2) and Google (AppEngine).

It is not fair of me to say that these two types of cloud or grid computing are disparate entities, because they're not. They're closely interwoven. However, it's also not fair of anyone else to say that they are directly tied together, except as all things seem to be directly tied together. Grid as it exists now is a means to solve large problems, and Cloud as it exists is a means to provide the solvers for those large problems. They're complimentary. I just don't personally see where it had anything to do with the rest of the school, except both are about huge numbers of computers under the control of a few operators. This is a subjective opinion, and if you have your own opinion, feel free to voice it below, or write yer own post ;)

At the moment I'm posting this as complete, but expect a followup post before too long for me to redress part of this. The total composition of Friday's presentation obviously cannot be compiled into three paragraphs, but neither can any other day's session be compiled as short as I've made those. It's just that Friday's session was the most pertinent to a "moving forward" group of students.

Thursday, July 16, 2009

Thursday - Group Collaboration

Thursday's events were primarily focused on a team-exercise that was intended to showcase our ability to use the technologies we had learned about during the previous two weeks. This opportunity was not so much given for ourselves as for the staff to see where the individuals had learned from the presentations, and perhaps to offer feedback for the next cycle of the ISSGC.

Having said that, I'll mention briefly that the students had been organized as one team for the scavenger hunt, but that due to some unintentional "weighting" of my original team, the teams were restructured. Our original team was solely OSG participants, and as a team we thought this was unfair to the spirit of the competition, so we made sure to mention it to the staff. Therefore, the new teams were given to us almost just-in-time for this competition. Nevertheless, we each as appropriate bandied together as a team and set off to work when the assignment was given to us.

So let me set up the "what happened here" so everyone can get a glimpse. The morning of the competition we gathered together in the main auditorium and David set the stage for what we were about to be given. We had approximately 24 hours to complete the task, which was to find each of six "pillars" focused around a given technology. Each pillar could only be found with the technology in question, so for instance, the gLite pillar could not be found with the Condor toolkit.

Each pillar in question was arranged in a 2D space on a Cartesian grid of -10,000 to 10,000 on each axis. Each pillar then contained a plaque that would contain a word, and once the word was found, it was to be keyed into the scoring system along with the coordinates where it was found. The executable to be used to find each word was a pre-compiled jar given to us by the project leads and listed on the appropriate technologies page. Some pretense was made that the experiment was in a 3D space, but I've done the exercise, it was 2D only.

Initially the students were given a database to write an app to retrieve values from, and those values were to be clues to help students find some initial pillars. Most of the teams got those clues pretty quickly, and were off to the next part; find the pillars themselves.

Ok, so now I'm going to drop out of narrative and I'm going to get into "Cole's point of view" so this will be totally subjective and might even upset people. Guess what, it's a blog. If you don't like it, I don't know what to tell you. But consider yourself warned. I was not impressed, so this is likely to be somewhat inflammatory.

The point of the project is to find the pillars by searching a region of "space", and once the pillar is found, dig down till you find the plaque. The way it should work was like this:

java jar x1 y1 x2 y2 scale

This generates a region (think piece of graph paper) with the number of cells equal to (x2-x1)*(y2-y1)*(1/scale) so if you chose

-10 -10 10 10 1 you would get a region with 400 cells

and then you get back one of the three results:
These are not the droids you are looking for, move along (ok, humor aside: no pillar here)
Found something interesting, can't see it
Hey, writing! (and then some one or more letters)

However, the results were stored in one of 5 files (for each pillar) with the location of each pillar (top left) and the height and width of each pillar. Lastly was the text.

To examine the pillar, the jar would read that data (every time, and as I'll get to in a moment, that's a lot of reads, so a lot of disc-IO) and then compute some random noise to fill a grid the size that was specified in the inputs. The problem is that it would also sometimes fill the cell in question with noise, so given the same inputs to the same technology, it was possible to get different results (trust me, I repeated it, I got different results for the same inputs). Now, do I expect every program to always give me perfect results? No. But if I'm doing statistical analysis against a rowset returned from a database or if I'm running a grep for a string, I don't expect to get different results everytime, provided the inputs are consistent and provided there's no room for network errors. I'll accept that those may be flawed, and that that could be the problem. But for something this simple (I decompiled the JAR, I know what the code did), I would've expected consistent results, which I didn't get.

Given a perfect world, a person could take an example like the above, make the sized chunks reasonably sized, and then look to see if a pillar was found in that region. If not, expand the region slightly, and search again. If so, then the difference region was where a pillar could be found. If not, repeat and test again This is basic geometry.

However, and this is where I got disillusioned, the students would form one large region with one really small scale and they would then send a batch job with those inputs off to be farmed against the data. And they would do that with 800 batch jobs or more at one time. For each team on each technology.

Let me repeat that if I may. 5 teams submitted 5 technologies worth of jobs at 800+ runs per batch against a server, creating 100's of thousands of jobs, and each job had one output file for stdin, stdout and stderr. Therefore, they were creating 300's of thousands of output files, plus the inputs to run them in the first place, so 400's of thousands of very small files were created on systems that don't do well with lots of small files in one directory. So many jobs were submitted individually that the submission system effectively underwent a thread-fork, and the system also opened so many connections to the test server that the system had to be rebooted.

This was not what we were taught to do at the school, and this was not a reasonable way to use the system. No-one in the real world would submit 30,000 live job submissions into a queue creating over 100,000 files in a single directory, just to look for a bit of info in a database. Of course, the database's don't normally return garbage data on multiple runs of the same inputs. They may occasionally, and that's fine, but that's where you test three times, look for two consistent answers, and then accept that answer.

So, let's say that I'm the one out of line. Let's consider: The test systems were setup with no limitations (30k+ threads active at one time was not considered) even tho live systems would have those same limitations, the student's didn't consider what the problem was, they went for brute force, and then there's the matter of the queues.

The system was setup so that gLite, Globus and Unicore (IIRC) were all setup to use a single PBS scheduler. Three toolkits were competing for one hardware resource, with thousands of submissions against each queue. In a real-world environment, those queues wouldn't have been squashed with those numbers of requests, as gatekeeper software would've prevented those types of rapid fire submissions, and the meta-schedulers would've sent different requests off to different PBS queues in the first place.

So even if it were a valid real-world experience, nobody would submit a few hundred thousand jobs against two simple queues on two small clusters, that would've been submitted to the grid.

But also remember that each job had to read from the same set of data files, but there were five of them, so I'm sure either the file got cached into memory by the filesystem IO controller (plausible but unlikely) or the disks got thrashed. I'll never know for sure. But triggering a few hundred thousand reads against five files is not a sane activity. Triggering a few thousand reads against a few hundred files is sane. Does anyone see what I mean?

So, while I was quite pleased with the assignment, and while it would've been fun for group collaboration, most of the participants did not seem to understand what the goal was, nor how to attack it.

I personally did it the way it should've been done, and saw that it took just a few submissions to get results, and I watched Ben Clifford (one of the moderators) find his first pillar in about three minutes without brute force approaches, but when I spoke with many of the students, they didn't see where an elegant bounding approach was the best way to do it. They felt that submitting 100's of thousands of brute force attempts was sufficient.

Because of this, and because there weren't sufficient technological safeguards put into place, and because the code returned inconsistent results on the same inputs (and no, don't ask me now, it's been too many days since I did it, but I did bitch then and nobody asked me for those verification inputs) - because of those reasons, I was disillusioned with the experiment.

Now, having said ALL of that, let me finish my post with this. I learned a lot more that day than you might think. I am very very grateful to the organizers for setting that up. I had a lot of fun working with the advisers to solve the problems. And I'm glad I got the chance to play with real tech on a reasonable problem.

Ok, where do I need to clarify my points? Feedback people, feedback!

Wednesday, July 15, 2009

Wednesday's session, a placeholder

Wednesday's AM sessions were on the PGRADE system, and then we talked about the semantic web.

PGRADE is a Hungarian based project that is similar to UNICORE. What PGRADE does is it helps to provide a more graphically oriented approach to monitoring an application on the grid. It also collects trace information as the programs run to help tweak future runs, and to understand what the project is doing on the grid at the time. The system also allows the user to access some pre-formed templates to assist in writing more coherent code, and code that should perform better in a parallel environment.

PGRADE is not a meta-scheduler or a resource handler, instead, it works "on top of" Globus or Condor, for example. It uses the same mechanisms for job submission and reconciliation as any of the command-line-only apps would use; PGRADE helps the user to use those utilities. In an environment where grid specialists can't help develop every application, this is a really handy utility, and the only room I can suggest for improvement is a standard reporting mechanism built into the Globus or Condor models (again, for instance) that would allow program tracing to be more effective. This is not a light endeavour - especially given as how neither the standard Linux nor Windows kernels offer such tracing by default - so I would have to say it appears they are doing the best they can to offer tracing.

As for the semantic web and it's relation to Grid. As a computer professional, I tend to get all choked up when people talk about "the semantic web" because it seems so self-pretentious. I just thought I might toss that out there, to kind of catch the attention of my readers. (And since some of the presenters and attendees felt the same way)

However, in this regard, I would have to say that I firmly believe that the OGSA ontological semantic model is doing exactly what semantic web was designed/intended to do. I'm not going to go into a huge number of specifics on what OGSA nor the semantic web are, because both are easily googlable. Instead I just wanted to mention that they were the topics of discussion.

I will say now that the concept of semantic data is best summed up as this: All data in modern systems is inherently binary, and as such, a binary blob has no distinct characteristics. Sure, a certain binary blob may correspond to ASCII text, and we may be able to read it, but the fact that it's ASCII does us no good, until we can associate, for instance, an author, or even a title for the file. That may sound too simple to the casual reader, but that's exactly what we're talking about. The trick is how does one go about attaching the metadata to the file? That's what the OGSA semantic ontological model is all about, giving us a defined format and model.

Further details are best left up to specific questions, in my mind, but perhaps that's because I also feel comfortable answering the question "what's the difference between a database and a filesystem". - Cheers!

Teusday - Were YOU paying attention?

So Tuesday was the day we went to the beach for fireworks, but before that, we had lectures and lessons. We learned about GridSAM, and many of the other technologies coming from OMII, in the UK. I really like this tech, and am interested to deploy it on a test grid, so I can learn for myself.

Some of the highlights I took away for myself include (as I understand them):
Common-VFS hooks, so for the user it's seamless, and
Workflow support, and
Common support for all the major middlewares and submission engines.

I don't want to detract from GridSAM, it's just that by this point in the school, we were all a little overwhelmed in what we were learning.

So, look, everyone just wants to see pictures, these are the pictures! "The Bastille Day fireworks show", as I understand it, although I'm sure someone will correct me. These are what I think were fairly representative of the ones I took, but others took some much better shots, so you'll just have to check with them and around to see what there was. I'll also mention that this show was one of the shorter I've seen in a while.

Monday, July 13, 2009

Monday, July 13, 2009 Session Notes

So today is a bit of a odd-schedule. First we started with a clip of the daily show from http://www.thedailyshow.com/watch/thu-april-30-2009/large-hadron-collider and let's see if I can embed it...

The Daily Show With Jon StewartMon - Thurs 11p / 10c
Large Hadron Collider

www.thedailyshow.com

Yay, now you can see how we started our day. So otherwise, I would have to say, today is going pretty well. We've started with a discussion of grids and their common characteristics, talking about the infrastructure that's necessary to have an effective and efficient system. I firmly believe that we're seeing and discussing this today because we're the ones who are going to maintain and build this system; to improve it and to make it better. However, I'm not seeing any sort of non-political discussion, so once again I come back to something I've tried to keep off the blog, which is that the grid seems to have a nearly felonious association of productivity and politics. For now, I'll leave out this sort of discussion, as it can be very subjective.

So the afternoon was a set of Q&A on topics regarding the grid, and some of the topics were:
Filesystems on the grid, The Cloud vs the Grid, How do we do performance metrics on the Grid?, What are the key differences between HTC and HPC. The answers wavered between the technical and the requirements, but without the context of the school it's not really fair to examine the aspects of the Q&A. I'm just blogging this part in context of "this was part of the school".

Cheers, C

Sunday, July 12, 2009

Sunday outing - Cole's perspective

Well, today I finally passed 3GB of data collected with my camera. Yeah, at this point it's going to be difficult shy of hosting it on my own to get everyone a copy. Stay tuned, if I upload everything, and for those who want their own snapshot, bring me a laptop or a USB Drive of over 8 GB by the end of the show. Also, to all other participants, I'm collecting photos!

Saturday, July 11, 2009

Saturday at the base camp

Ok, so Saturday was a split day, where we went to lecture/practical in the AM, and we went to Nice in the evening. Let's start with the lecture/practical now, and I'll do the trip to Nice after a break. And yeah, I started with a pic of me, it's my blog post ;)

From my point of view, as a grid virgin, I thought that the presentation that Michael gave really coalesced everything into understanding, so I give kudos to him. Though it could be that I've just been listening to it for a week so it finally kicked in...

So anyways, Saturday was about ARC, which was previously NordicGrid, IIRC, and this was mostly a catch-up to the ones who are familiar with what has been happening with ARC, and how they are trying to get to a 1.0 release. This is an incomplete toolkit, but from what I've seen, of the middlewares, it seems to be the closest to my way of thinking of the EC toolkits. (I'm still partial to Condor for some reason, but I digress by mentioning Condor here)

And of course, after the lecture, nearly everyone was ready to goto Nice. - So let's talk about Nice. What's to say? We all went in different directions. The guys in my group went to the beach, the girls that I talked to went shopping. Later we all met at the restaurant for dinner. Now we have pics.

Haha, now you've got three men at the beach, so I'm not doing anymore beach pics here. C'mon, this is a family affair, no? Anyways, the beach was fun, dinner was better, and I'll be uploading those with the actual picture archives we're assembling later. Cheers.