Saturday, October 11, 2008

How to guarantee you will not be considered for a student internship

Since I founded the Collaborative Software Development Laboratory in 1991, I have provided research positions and internships to students from across the world, including Germany, Italy, India, China, Japan, Australia, Iceland, and Indonesia. Providing these opportunities to students, and learning from their differing cultural backgrounds is one of the great pleasures of being a professor.

Every semester I receive dozens of emails from students around the world who are requesting consideration for a research position of some sort in my lab. Unfortunately, most of them are quite similar to the one I just received this morning:

Dear Prof.,
To introduce myself, I am a 3rd year student of the Department of [Deleted] Majoring in Statistics and Informatics(5 yr.Integrated) at [Deleted], [Deleted]'s premier research organization, looking at the possibility of obtaining a position for Summer Internship, gelling with my academic background. I am aware of the superior quality of research at your institute ,I have decided that your current research work matches my interests to a remarkable degree.

Enclosed please find a copy of my resume. A number of details about my profile appear in the same. Yet no resume can comprehensively spell out everything.If my profile, prima facie matches with your requirements for a Summer Intern, please revert back, so that I could furnish any more relevant information.May I please also enquire whether some funding may be available for this internship in the form of a grant or scholarship?
Looking forward to a reply in the affirmative.

If selected by your consideration I promise to complete my assignments with utmost sincerity.

Thanking you
Your Sincerely....

There are, of course, a number of grammatical errors in this email, but since this student is clearly a non-native speaker, those errors would not deter me in the slightest from considering him for a research internship.

What makes this request a non-starter is the fact that this student sent me a form letter: there is not a single detail in this letter that provides evidence that the student has any clue about my research interests. Indeed, this student does not even take the time to address the email to me personally!

So, with the goal of helping other students who might be interested in academic experiences outside of their current environment, here is a simple guideline:

Send 10 personal, carefully written emails to professors whose research interests really do match your own, with concrete details about their research and how it intersects with your academic interests. These 10 emails have higher odds of success than (for example) 1000 generic emails sent to every professor in the country of your choice.

What might a "personal, carefully written email" include? Here are some ideas:
  • Include the professor's actual name.
  • Include the professor's actual laboratory name.
  • Include references to at least two recent publications of the professor, with questions and/or comments about the papers that indicate that you have read the papers and reflected on them.
  • Include some simple, but concrete ideas on how you might contribute to the research. These don't have to be feasible or ground breaking. Just demonstrate that you're trying.
  • Optionally, include other areas of interest of yours, which might have some interdisciplinary connection.
This, of course, takes time: perhaps five or six hours per email.

When I receive this kind of email from a student, I consider it carefully, and even if their interests don't really match mine, I reply out of respect for the energy they clearly put into their request and try to provide pointers to colleagues with better matching interests.

And there you see the key to why this approach is more effective: if you, as a student, devote this kind of energy up front on a small number of letters to a small number of professors, you can enlist our help in the search process. In terms of searching the space of appropriate research institutes, having professors guide your search is significantly more effective than spewing out 1000 generic emails.

Have I ever received such a letter from a student? Yes, several times, and in fact the most recent student who introduced himself this way is arriving in my lab on Monday to start his internship.

Tuesday, August 26, 2008

Reflections on Google Summer of Code 2008


Back in April, we applied Hackystat to the 2008 Google Summer of Code program. We didn't know too much about it, other than that it provided a chance for students to be funded by Google to work on open source projects for the summer.

With great glee, we learned in March that Hackystat was accepted as one of the 140 projects sponsored by Google. The next step was to solicit student applications, which we did by sending email to the Hackystat discussion lists. We ended up with around 20 applications. There were a few that were totally off the wall from people who had no clue what Hackystat was, and a few others that were disorganized, incomplete, or otherwise indicative of a student who would probably not be successful. But, a good dozen of the 20 applications appeared quite promising and deserving of funding.

Google then posted the number of "slots" for each project--the maximum number of students that they would support. Hackystat got 4 slots. The number of slots is apparently based partially on the number of applications received by the project, and partially on the organization's past track record with GSoC. Hackystat had no prior track record, and couldn't compete with the number of applications for, say, the Apache Foundation. The GSoC Program Administrator answered the anguished pleas of new organizations who got less slots than they wanted by basically saying, "Look, we don't want to give you a zillion slots and then have half a zillion projects fail. Do a good job this year with the slots you were given and reapply next year." Sound advice, actually.

We then started ranking the applications to figure out which four students should be funded. It was difficult and frustrating, because there were many good applications. At the end, we came up with four students who we felt had a combination of interesting project ideas and a good chance of success based on their skills and situations.

We were right. Three out of four of the students successfully completed their projects, and the fourth student had to drop out of the program due to sudden illness, which no one could have foreseen.

GSoC requires each student to have a mentor. This summer, Greg Wilson of the University of Toronto and I each took two students. Greg's students were physically sited at the University of Toronto, so he was able to have face-to-face interactions. My students were in China.

Student support took several forms over the summer. First, there was email and the Hackystat developer mailing lists. At the beginning of the summer, I received a few emails from students that I redirected to the mailing list, so that other project developers could respond, and also because the question asked was of general Hackystat interest. Fairly quickly, the students caught on, and started posting most of their general-interest questions to the list. I think this was one conceptual hurdle for the students to get over: they were not in a relationship just with me or Greg, but also with the entire Hackystat developer and user community. While there were certainly issues pertaining to the GSoC program that they discussed privately with their mentors, they were also "real" Hackystat developers and needed to learn how to interact with the wider community. All of the students acclimated to their new role.

We also requested that the students maintain a blog and post an entry at least once a week that would summarize what they'd been working on, what problems they'd run into, and what they were planning to do next. This was also pretty successful. You can see Shaoxuan's, Eva's, and Matthew's blogs for details. Interestingly, the Chinese students found they could not access their (U.S. created) blogs once they were in China, and so had to use Wiki pages.

Finally, I also set up weekly teleconferences via Skype with the two students I was mentoring in China. This was a miserable failure, probably due to my own lameness. Despite the fact that I live in a timezone (HST) shared by very few of my software engineering colleagues, and thus have lots of experience with multi-timezone teleconferencing, the Hawaii-China difference just totally threw me. The international dateline did not help matters. At any rate, we simply fell back to asynchronous communication via blogs and email and that worked fine.

For source code and documentation hosting, we used two mechanisms. The Hackystat project uses Google Project Hosting, and so the students I mentored used this service. Greg is the force behind Dr. Project, and so the students he mentored used that service. As part of the wrapup activities, his students ported their work to Google Project Hosting to conform to the Hackystat conventions.


So, what did they actually accomplish? Matthew Bassett created a sensor for Microsoft Team Foundation server. Here's a screen shot of one page where the user can customize the events the sensor collects:

The sensor itself is available at:

Eva Wong worked on a data visualization system for Hackystat based on Flare.

Her project is available at:

Finally, Shaoxuan Zhang worked on multi-project analysis mechanisms for Hackystat using Wicket. Here is a screen shot of the Portfolio page:

His project is available at:


So, what makes for a successful GSoC?

First, and most obviously, it's important to have good students. "Good", with respect to GSoC, seems to boil down to two essential attributes: a passion for the problem, and the ability to be self-starting. (As long as the student "starts", the mentors and other developers can help them "finish"). It was delightful to read Matthew's blog entries about Team Foundation Server: he obviously likes the technology and enjoyed digging into its internals. At one point in the summer, Shaoxuan sent me an email in which he apologized that he had not been working much for the past week because he just got married, but he'd work extra hard the next week to catch up! We clearly had passionate students.

It also helps to have good mentors. In the Hackystat project, we have an embarrassment of riches on this front, since the project includes a large number of academics who mentor as part of their day jobs. In the end, we only needed two active mentors for the four students, but we easily had mentoring capacity for a couple dozen students.

Establishing effective communication systems is critical. Part of this is technological. We found that email and blogs worked well. Skype did not work well for me, but that was probably operator error on my part. Greg had the additional opportunity to use face-to-face communication, which is certainly helpful but not at all necessary to success. The other part is social. Most of our students needed to learn over the summer to: (a) request help quickly when they ran into problems, and (b) direct their question to the appropriate forum: either the Hackystat developer mailing list or privately to a mentor via email. This wasn't particularly difficult or anything, it was just a part of the process of understanding the Hackystat project culture.

I think I would have more insightful "lessons learned" had any of the student projects crashed and burned, but fortunately for the students (and unfortunately for this blog posting), that simply didn't happen.

For the Hackystat project, participation in GSoC this summer has had many benefits. Clearly, we'll benefit from the code that the students have created and which now is publically visible in the Hackystat Component Directory. We are crossing our fingers that the students will continue to remain active members of the Hackystat community.

GSoC has also helped to create a new "center" of Hackystat expertise at the University of Toronto. We hope to build upon that in the future.

GSoC also catalyzed a number of discussions within the Hackystat developer community about the direction of the project and how students could most effectively participate. These insights will have long term value to the project.

I believe we are now significantly more skillful at mentoring students. I hope we get a chance to participate in GSoC 2009, and that we can build upon our experiences this summer next year.

Saturday, August 23, 2008

Clean code

There is a persistent, subtle perception that "coding is for the young": that as you progress as a technical professional, you "outgrow" coding. Perhaps this is because most organizations pay senior managers way better than their technical staff. Perhaps this is because many software developers hit a glass ceiling and decide they've learned all there is to know about code. Perhaps this is because coding is seen as a low-level skill and vulnerable to outsourcing.

Other disciplines lack this perception: no one would ever want Itzhak Perlman to give up playing violin for a management position, or believe that his musical development ended while he was in his 20's, or that a symphony would outsource his position to a young virtuoso, no matter how talented, on the basis that they could get equivalent quality for less money.

I think part of the reason for this difference in perception is a difference in visibility: one can immediately hear the quality of a great violinist, even if one does not play violin themselves. The quality of the work produced by a great coder is, unfortunately, almost invisible: how do you "hear" code that is simultaneously flexible, maintainable, understandable, and efficient? How do you hear it if you are a senior manager who doesn't even know how to code?

Clean Code: A Handbook of Agile Software Craftsmanship is a nice attempt to make the quality of great code visible, and in so doing makes some other points as well: that great code is very, very difficult to write; that even apparently well-written code can still be significantly improved; and that the ability to consistently write great code is a goal that will take most of us decades of consistent practice to achieve.

Clean Code is written by Robert Martin and his colleagues at Object Mentor. It begins with a chapter in which Bjarne Stroustrup, Grady Booch, Dave Thomas, Michael Feathers, Ron Jeffries, and Ward Cunningham are all asked to define "clean code". Their responses, and Martin's synthesis, would make a stunning Wikipedia entry for "clean code".

However, as he points out, knowing what clean code is, and even recognizing it when you see it, is far different (and far easier) than being able to actually create it yourself. The most interesting parts of the book are case studies where code from various open source systems (Tomcat, Apache Commons, Fitnesse, JUnit) is reviewed and improved.

Along the way, the "Object Mentor School of Clean Code" emerges. I found much to agree with, along with some controversial points. For example, I am a great believer in the use of Checkstyle to ensure that all public methods and classes have "fully qualified" JavaDoc comments (i.e. that all parameters and (if present) return values) are documented. The OMSCC actually has a fairly low opinion of comments: that they should be eliminated whereever possible in favor of code that is so well-written that comments are redundant, even for public methods. As a result, I don't think they could use an automated tool such as Checkstyle for QA on their JavaDocs.

In some cases, they create "straw man" examples: such as using three lines for a JavaDoc comment that could easily be contained on one line, and then complaining that the JavaDoc takes up too much vertical space.

From a truth in advertising standpoint, the book should have the word "Java" in its title. All of the examples are in Java, and while the authors attempt to generalize whenever possible, it is clear that many aspects of cleanliness are ultimately language-specific. While extremely useful to Java programmers, I am not sure how well these lessons would translate to Perl, Scala, or C.

One final nit: some of the chapters are quite small---dare I say too small, such as the five page chapter on Emergence. On the other hand, the chapter on concurrency gets a 30 page appendix with additional guidelines. If there is a second edition (and I hope there will be), I expect that the topics will get more balanced and even treatment.

Despite these minor shortcomings, I found this book to be well worth reading. I was humbled to see just how much better the authors could make code that already seemed perfectly "clean" to me. And I am happy that someone has made an eloquent and passionate argument for remaining in the trenches writing code for 10, 20, or 30 years, and the maturity and beauty that such discipline and persistence can yield.

Tuesday, March 18, 2008

From Telemetry to Trajectory

Cam Moore stopped by my office to chat yesterday, and in 15 minutes he managed to completely revolutionize my thinking about how to visualize software project information over time. (Not bad, most visitors usually need at least 30 :).

One outcome of our prior research in Hackystat was the idea of Software Project Telemetry, which Cedric Zhang explored for his Ph.D. thesis. Software Project Telemetry defines a nifty domain specific language for defining the kinds of software measures you want to consider, how to combine them together, and how to display them as trend lines. One thing that's neat about telemetry is that it enables you to discover covariances among measures. For example: "Hey, when my percentage of TDD-compliant episodes drops from 85% to 40%, my coverage drops precipitously too!" (That really happened.) It also serves as a way to create an "early warning system" for projects in trouble. For example, if coverage is steadily dropping, complexity is steadily increasing, and coupling is steadily increasing too, then it looks like your design or architecture is in trouble because the trends for three orthogonal static measures of structural quality are all deteriorating. Here's an example of a telemetry chart:

As the above chart illustrates, the X axis is always time, and there can be multiple Y axes for each kind of trend line.

Telemetry turns out to work really well for in-process monitoring of a single project, and we definitely want to continue supporting and enhancing software project telemetry.

Recently, we have started to think about software project "portfolio management": what happens when an organization has 100 or more projects under development and wants insight not only into individual projects, but into the "portfolio" as a whole? For example, what projects are similar to each other? What projects constitute "outliers"? What kind of management or organizational changes appear to make broad impacts across multiple projects?

These questions pose difficulties for a conventional telemetry oriented viewpoint. For example, how would you compare the telemetry for a project that is six months in duration with the telemetry for a project that is 12 months in duration? What if one project started in January and another project started in June? The notion of having multiple X axes in addition to multiple Y axes seems problematic at best.

Enter Cam. His idea is to think about "trajectory" instead of "telemetry". To make it simple, let's consider a situation where we are collecting three measures for a project: coupling, complexity, and coverage. Instead of using a 2D plot with the X axis being time, we use a 3D plot, where the three axes are coupling, complexity, and coverage. Now time is implicit in the "trajectory" of the plots through space.

Here's an exceedingly lame mockup using 3D VOPlot with a little powerpoint post-processing:

I actually took some astronomical data to make this image, so you should ignore the axis values and pretty much everything else about this image except the basic notion that we are now focusing on trajectory, rather than telemetry, and this makes certain things much easier.

First, since the "time" dimension is now implicit, we can much more easily compare projects that start/end at different times and/or have different durations. Just plot their trajectories using different color points for different projects, and the scaling/displacing comes "for free". My mockup illustrates two projects, one with a sequence of blue dots, and one with a sequence of green dots. The visualization makes a couple of things pretty obvious: (a) for a while, Project Blue and Project Green have pretty similar trajectories, except that Project Green's complexity is below Project Blue's, and (b) something weird happened to Project Blue near the end that didn't happen to Project Green.

Note that we have no idea about the relative durations of Project Green or Blue, nor about their start/end dates. And in many cases abstracting away those details might be exactly the right thing to do.

Second, we can now ask ourselves a whole new set of interesting questions about the trajectories associated with different projects, such as:
  • What set of measures create interesting trajectories?
  • Which projects have similar trajectories?
  • Which projects have anomalous trajectories?
  • What about higher dimensionalities, where we want to compare trajectories involving more than three measures?
I look forward with great anticipation to the next time Cam drops by for a little chat.

Thursday, January 17, 2008

Hackystat Version 8

Hackystat Version 8 is now in public release. Hackystat is an open source framework for collection, analysis, visualization, interpretation, annotation, and dissemination of software development process and product data. This eighth major redesign of the system is intended to retain the advantages of previous versions while incorporating significant new capabilities:

RESTful web service architecture. The most significant change in Version 8 is its re-implementation as a set of web services communicating using REST principles. This architecture facilitates several of the features noted below, including scalability, openness, and platform/language neutrality.

Sensor-based. A primary means of data collection in Hackystat is through "sensors": small software plugins to development tools that unobtrusively collect and transmit low-level data to the Hackystat sensor data repository service called the "SensorBase".

Extensible. Hackystat can be extended to support new development tools by the creation of new sensors. It can also be extended to support new analyses by the creation of new services.

Open. All sensors and services communicate via HTTP PUT, GET, POST, and DELETE, according to RESTful web service principles. This "open" API has two advantages: (1) it makes it easy to extend the Hackystat Framework with new sensors and services; and (2) it makes it easy to integrate Hackystat sensors and services with other information services. Hackystat can participate as just one part of an "ecosystem" of information services for an organization.

High performance. The default version of the SensorBase uses an embedded Derby RDBMS for its back-end data store. Initial performance evaluation of this repository, in combination with our multi-threaded client-side SensorShell, has been been quite encouraging: we have achieved sustained transmission rates of approximately 1.2 million sensor data instances per hour. The SensorBase is designed to allow "pluggable" back-end data stores. One organization, for example, is using Microsoft SQL server as the back-end data store.

Scalable. A natural outcome of a web service architecture is scalability: one can distribute services across multiple services or aggregate them on a single server depending upon load and resource availability. Hackystat is also scalable due to the fact that each organization can run its own local SensorBase, or even multiple SensorBases if required. Finally, Hackystat can exploit HTTP caching as yet another scalability mechanism.

Secure. While Hackystat maintains a "public" SensorBase and associated services for use by the community, we expect that most organizations adopting Hackystat will choose to install and run the SensorBase and associated services locally and internally. This facilitates data security and privacy for organizations who do not wish sensitive product or process information to go beyond their corporate firewalls.

Platform and language neutrality. Hackystat's implementation as a set of RESTful web services makes it language and platform neutral. For example, a sensor implemented in .NET and running on Windows might send information to a SensorBase written in Java running on a Macintosh, which is queried by a user interface written in Ruby on Rails web application hosted on a Linux machine.

Open Source. Hackystat is hosted at Google Project Hosting, and distributed among approximately a dozen individual projects. The "umbrella" Hackystat project includes a Component Directory page with links to all of the related subprojects. Since most subprojects correspond to independent Hackystat services, they are typically free to choose their own open source license, though most have chosen GNU V2.

Out of box support for process and product data collection and analysis. The standard Hackystat includes a variety of process and product data collection and analyses, including: editor events and developer editing time, coverage, unit test invocations, build invocations, code issues discovered through static analysis tools, size metrics, complexity metrics, churn, and commits. Of course, the Open API makes it possible to extend this list with more.

When we began work on Hackystat in 2001, we thought of it primarily as a software metrics framework. Seven years later, we find that vision limiting, because it tends to focus one on the collection and display of numbers. Our vision for Hackystat now is broader: we believe that the collection and display of numbers is just the first step in an ongoing process of collaborative sense-making within a software development organization. An organization needs numbers, but it also needs ways to get those numbers to the right people at the right time. More importantly, it needs ways to incrementally interpret, re-interpret, and annotate those numbers over time to build up a collective consensus as to their meaning and implications for the organization. Our goal for Hackystat Version 8 is to be an effective infrastructure for participation in the broader knowledge gathering and refinement processes of an organization, or even the software development community as a whole. If successful, it can play a role in creating new mechanisms for improving the collective intelligence of a software development group.