- UCLA M.S. Computer Science
- Android Developer
- Freelance IT
or, thoughts from a random graduate student
In short, I'm back at school, where the objective is to read lots, but not to write as much.
Someone mentioned to me while I was working at Google that it's more interesting to build things than measure things, which was one of the most notable things I took away from the internship.
As for the MS thesis/project, the hardest part is still starting, and I still lack a good idea in the partitioned global address space topic area to build upon yet.
I've found that in the short time I've been in graduate school, I've had to listen to a lot of presentations. I'm not saying I can give a good presentation, but I can surely tell you what makes a bad one. Here's a foolproof way to make sure your presentation is just as bad as many I've listened to already.
Step 1: Read off the slides
This one is pretty obvious, but it's amazing how many presentations still do it. Everyone can read for themselves, the audience doesn't need you to read for them.
Step 2: Speak very quickly
This one is bad too, especially if you are giving a talk in a language that is not your native language. It is difficult to keep pace if you can't figure out what is being said.
Step 3: Go over every possible detail
I've seen this happen quite a lot when presenters go over code and feel the need to go line by line, variable by variable. I think this is very similar to the COMMENT EVERYTHING mentality of people as they learn to program. It suffices to go over the tricky bits only, or just ask for questions on the code from the audience, I've rarely seen an audience fail to ask questions about code.
Step 4: Talk about how unprepared you are
Just don't do it. So very unprofessional.
Step 5: Argue/chat with your co-presenters
Save it for after the presentation.
I'd like to take a moment to talk about doing research. I'm currently working towards obtaining my MS degree, which entails either completing a thesis, or passing a (difficult) test. I think the thesis route is generally more useful as a whole, even though it's so much more scary, which is pretty much why I'm headed that direction. I'm theoretically doing research in multicore systems/parallel languages, although it really is difficult to come up with a testable, unique idea. Of course, if that wasn't the case then everyone would be doing it.
In any case, recently I've been reading a lot about PGAS languages, particularly UPC, and the runtimes associated with it. The main component that enables performance on distributed systems is RDMA functionality, which in the Berkeley UPC case, is supplied by GASnet, a communication library that abstracts the underlying communication fabric (Infiniband, Myrinet, Ethernet, etc) for high performance remote memory access. Now, where things get interesting is that the HPC community seems to have agreed to some extent that the SPMD/PGAS model is an interesting candidate for future computation, most prominently due to CUDA, as NVIDIA's graphics cards have enabled some incredible numbers on the latest TOP500 listings. Since today's commodity chips actually derive quite a bit (more than you expect) from yesterday's supercomputers, any sort of interesting functionality enabled by the communication network of today's supercomputers should be applicable to a future many-core (32+) chip. The key point here is that although the appearance of the system is quite different (think Jaguar vs TILE64), the architecture is really quite similar, because as the number of cores on chip increases, the need for a sophisticated communication network, very much like those found in today's massive supercomputers, becomes much more apparent. It is here that my research efforts are currently directed.
While all this is well and good, my problem is that I'm still lacking "the idea", although I have a suspicion that is due to me searching for some grand unifying theory of many-core systems as it were, instead of concentrating on finding some salient point of interest and digging from there.
Been reading articles about Partitioned Global Address Space since a week or two ago, and I have to say, the programming model is exactly the same as CUDA. The PGAS programming model languages (most prominent of which seems to be Unified Parallel C) were originally developed for use in supercomputer class machines such as the Cray T3D, SGI Origin 3000 (I love reading about these machines!) and Beowulf clusters. The main idea of these languages is to take advantage of the copious parallel resources of these machines by splitting the memory into several levels local to each processor. CUDA does this by dividing the memory and cache available to the stream processors on the GPU into thread local, block shared, and program global. These pretty much correspond to the private, shared, and global memory areas of Unified Parallel C, with the same thread restrictions. As I mentioned earlier, threads can be blocked. This is basically a logical bundling of threads meant for use on a particular task, as UPC and CUDA have shared memory for this end. The typical use of this block shared memory is to allocate each thread a small slice of shared memory to write results into. As a result, to obtain maximum performance in both UPC and CUDA, intelligent layout of data in block shared memory is paramount.
In terms of language features, UPC and CUDA differ only very slightly, the primary differences being the more flexible memory layout of UPC and the more restricted language of CUDA. Underneath the language runtime though, the architecture is also strikingly similar. On one hand you have machines like the Cray T3D, SGI Origin 3000, etc, which can arguably be considered ancestors to today's GPUs in a sense. The Crays and Origins of the past are typically massively parallel shared memory systems. Each compute board contains a few processing units, associated cache, and shared memory, as well as interconnects to many other similar compute boards. The memory structure of these machines can be mapped to the PGAS programming model fairly easily. Thread private data can reside in the caches present on the compute boards, while each block can be represented by an single compute board, or even a compute cabinet, with the aggregate memory used as block shared memory. Global memory can be thought of as the control/loader machine's memory. Interestingly, while the model described above can be considered relatively intuitive, the truth is that due to the hardware implementation of machines like the T3D and Origin 3000, it was frequently the case that partitioned global address space language performance may not have been as fast as a traditional message passing approach, or may have even been implemented using message passing. This was especially true in the case of the T3D, as it's specialized "shell" circuitry actually made the implementation of a global address space language quite difficult. In a fair amount of papers, strict comparisons against different parallel programming models are not as common as I would have liked. But let's move on to CUDA for now.
CUDA on the other hand relies on a control/loader machine, the PC, as well to initialize running of CUDA programs. Global memory can be thought of as this control machine's memory as well. From there it can load up and distribute sections of work to each logical thread block and associated shared memory. The processing elements (NVIDIA would have you call them stream processors I believe) of the GPU are also very similar to a compute board in that there are many of them, and each element has several processors with associated cache, and access to some of the global memory. (the memory model is more complex than I'm going into here, but IIRC, for some of the later GPUs, the number of processing elements is directly tied to memory bandwidth in a NUMA architecture of sorts) In many ways then, CUDA is the essence of late 80s/early 90s supercomputers distilled into a single add-in board for our modern PCs. Who says supercomputer technology is useless for the average consumer?