Friday, December 14, 2007

If Collective Intelligence is the question, then dashboards are not the answer

I've been thinking recently about collective intelligence and how it applies to software engineering in general and software metrics in particular.

My initial perspective on collective intelligence is that provides an organizing principle for a process in which a small "nugget" of information is somehow made available to one or more people, who then refine it, communicate it to others, or discard it as the case may be. Collective Intelligence results when mechanisms exist to prevent these nuggets from being viewed as spam (and the people communicating them from being viewed as spammers), along with mechanisms to support the refinement and elaboration of the initial nugget of information into an actual "chunk" of insight or knowledge. Such chunks, as they grow over time, enable long-term organization learning from collective intelligence processes.

Or something like that. What strikes me about the software metrics community and literature is that when it comes to how "measures become knowledge", the most common approach seems to be:
  1. Management hires some people to be the "Software Process Group"
  2. The SPG goes out and measures developers and artifacts somehow.
  3. The SPG returns to their office and generates some statistics and regression lines.
  4. The SPG reports to management about the "best practices" they have discovered, such as "The optimal code inspection review rate is 200 LOC per hour".
  5. Management issues a decree to developers that, from now on, they are to review code at the rate of 200 LOC per hour.
I'm not sure what to call this, but collective intelligence does not come to mind.

When we started work on Hackystat 8, it became clear that there were new opportunities to integrate our system with technologies like Google Gadgets, Twitter, Facebook, and so forth. I suspect that some of these integrations, like Google Gadgets, will turn out to be little more than neat hacks with only minor impact on the usability of the system. My conjecture is that the Hackystat killer app has nothing to do with "dashboards"; most modern metrics collection systems provide dashboards and people can ignore them just as easily as they ignore their web app predecessors.

On the other hand, integrating Hackystat with a communication facility like Twitter or with a social networking application has much more profound implications, because these integrations have the potential to create brand new ways to coordinate, communicate, and annotate an initial "nugget" generated by Hackystat into a "chunk" of knowledge of wider use to developers. It could also work the other way: an anecdotal "nugget" generated by a developer ("Hey folks, I think that we should all run verify before committing our code to reduce continuous integration build failure") could be refined into institutional knowledge (a telemetry graph showing the relationship between verify-before-commit and build success), or discarded (if the telemetry graph shows no relationship).

Thursday, December 6, 2007

Social Networks for Software Engineers

I've been thinking lately about social networks, and what kind of social network infrastructure would attract me as a software engineer. Let's assume, of course, that my development processes and products can be captured via Hackystat and made available in some form to the social network. Why would this be cool?

The first reason would be because the social network could enable improved communication and coordination by providing greater transparency into the software development process. For example:
  • Software project telemetry would reveal the "trajectory" of development with respect to various measures, helping to reveal potential bottlenecks and problems earlier in development.
  • Integration with Twitter could support automated "tweets" informing the developers when events of interest occur.
  • An annotated Simile/Timeline representation of the project history could help developers understand and reflect upon a project and what could be done to improve it.

I'm not sure, however, that this is enough for the average developer. Where things get more interesting is when you realize that Hackystat is capable of developing a fairly rich representation of an individual developer's skills and knowledge areas.

As a simple example, when Java programmer edits a class file, the set of import statements reveal the libraries being used in that file, and thus the libraries that this developer has some familiarity with, because he or she is using those libraries to implement the class in question. When a Java programming edits a class file, they are also using some kind of editor---Emacs, Eclipse, Idea, NetBeans, and thus revealing some level of expertise with that environment. Indeed, Hackystat sensors can not only capture knowledge like "I've used the Eclipse IDE over 500 hours during the past year", but even things like "I know how to invoke the inspector and trace through functions in Eclipse", or "I've never once used the refactoring capabilities." Of course, Hackystat sensors can also capture information about what languages you write programs in, what operating systems you are familiar with, what other development tools you know about, and so forth. Shoots, Hackystat could even keep a record of the kinds of exceptions your code has generated.

Let's assume that all of this information can be processed and made available to you as, say, a FaceBook Application. And, you can edit the automatically generated profile to remove any skills you don't want revealed. You might also be able to annotate the information to provide explanatory information. You can provide details about yourself, such as "Student" or "Professional", and also your level of "openness" to the community. After all that's done, you press "Publish" and this becomes part of your FaceBook or OpenSocial profile.

So what?

Well, how about the following scenarios:

[1] I'm a student and just encountered a weird Exception. I search the community for others with experience with this Exception. I find three people, send them an IM, and shortly thereafter one of them gives me a tip on how to debug it.

[2] I'm interested in developing a Fortress mode for Emacs, but don't want to do it alone. I search the community for developers with both expertise in Fortress and Emacs, and contact them to see if they want to work with me on such a mode.

[3] I'm an employer and am interested in developers with a demonstrated experience with compiler development for a short-term, well paying consulting position. I need people who don't require any time to come up to speed on my problem; I don't want to hire someone who took compilers 10 years ago in college and hasn't thought about it since. I search the community, and find a set of candidates who have extensive, recent experience using Lex, YACC, and JavaCC. I contact them to see if they would be interested in a consulting arrangement.

[4] I'm a student who has been participating in open source projects and making extensive contributions, but has never had a "real" job. I need a way to convince employers that I have significant experience. I provide a pointer in my resume to my profile, showing that I have thousands of hours of contributions to the Apache web server and Python language projects.

Hackystat is often thought of as a measurement system, and indeed all the above capabilities result from measurement. However, the above doesn't feel like measurement, it feels like social coordination and communication of relatively sophisticated and efficient nature.

Monday, November 5, 2007

Measurement as Mashup, Ambient Devices, Social Networks, and Hackystat

The new architecture of Hackystat has me thinking about new metaphors for software engineering measurement. Indeed, it has me wondering if where we are heading is even characterized best as "measurement" or even "software engineering". Alistair Cockburn, for example, has written an article on The End Of Software Engineering in which he challenges the use of the term "software engineering" as an appropriate description for what people do (or should do) when developing software.

Similarly, when we began work on Hackystat six years ago, I thought in fairly conventional terms about this system: it was basically a way to make it simpler to gather more accurate measures that could be used for traditional software engineering measurement activities: baselines, prediction, control, quality assessment.

One interesting and unintended side effect of the Hackystat 8 architecture, in which the system becomes a seamless component of the overall internet information ecosystem via a set of RESTful web services, is a re-examination of my fundamental conceptions of what the system could and should be. In particular, two Web 2.0 concepts: "mashup", and "social network", provide interesting metaphors.

Measurement as Mashup

Hackystat has always embraced the idea of "mashup". From the earliest days, we have pursued the hypothesis that there is a "network effect" in software process and product metrics; that the more orthogonal measures you could gather about a system, the more potential you would gain for insight as you obtained the ability to compare and contrast them. Thus, we created a system that was easily extensible with sensors for different tools that could gather data of different types.

Software Project Telemetry is an early result of our search for ways to obtain meaning within this network effect. In Software Project Telemetry, we created a language that enables users to easily create "measurement mashups" consisting of metrics and their trends over time. The following screen image shows an example mashup, in which we opportunistically discovered a covariance between build failures and code churn over time for a specific project :

Hackystat 8 creates new opportunities for mashups, because we can now integrate this kind of data with other data collection and visualization systems. As one example, we are exploring the use of Twitter as a data collection and communication mechanism. Several members of the Hackystat 8 development group "follow" each other with Twitter and post information about their software development activities (among other things) as a way to increase awareness of project state. Here's a recent screen image show some of our posts:

There are at least two interesting directions for Twitter/Hackystat mashups. Assuming that members of a project team are twitter-enabled, we can provide a Hackystat service that monitors the data being collected from sensors and sends periodic "tweets" that answer the question "What are you doing now?" for individual developers and/or the project as a whole. Going the other direction, we can gather "tweets" as data that we can display on a Simile/Timeline with our metrics values, which provides an interesting approach to integrating qualitative and quantitative data.

A second form of mashup is the use of internet-enabled ambient devices such as Ambient Orbs or Nabaztag. The idea here is to get away from the use of the browser (or even the monitor) as the sole interface to Hackystat information and analyses. Instead, we could move toward Mark Weiser's vision of calm technology, or ""that which informs but doesn't demand our focus or attention".

The net of all this is that Hackystat is evolving from a kind of "local" capability for mashups represented by software project telemetry to a new "global" capability for mashups in which Hackystat can act as a first class citizen of the internet information infrastructure.

Software development as social network

Google is releasing an API for social networking called OpenSocial. This API essentially enables you to (a) create profiles of users; (b) publish and subscribe to events, and (c) maintain persistent data. You can use Google Gears to maintain data on client computers, and thus create more scalable systems. Google intends this as a way for developers to create third party applications that can run within multiple social networks (MySpace, Orkut), as well as enable users to maintain, transfer, and/or integrate data across these networks.

So. What would Hackystat look like, and what would it do, if it was implemented using OpenSocial?

First, I think that in contrast to the current analysis focus of Hackystat, in which the concept of a "project" as an organizing principle is very important, in an OpenSocial world you might not be so interested in a project-based orientation for analyses. Instead, I think the emphasis would be much more on the individual and their behaviors across, and independent of, projects.

For example, your Hackystat OpenSocial "profile" might include analysis results like: "I worked for three hours hacking Java code yesterday", or "I have a lot of experience with the Java 2D API", or "I use test driven design practices 80% of the time". All of these might be interesting to others in your social network as a representation of what you are doing currently and/or are capable of doing in future. The process/product characteristics of the projects that you work on might be less important in an OpenSocial profile for, I think, two reasons: (a) it is harder to understand the individual's contributions in the context of project-level analyses; and (b) project data might "give away" information that the employer of the developer does not want published.

Which brings me to a second conjecture: issues of data privacy or "sanitization" will become much more important for social network software engineering using a system like OpenSocial. To make the example analyses I listed above, it must be possible to collect detailed data about your activities as a developer (sufficient, for example, to infer TDD behaviors), yet publish them at an abstract enough level that no proprietary information is being revealed. That is a fascinating trade-off that will require a great deal of study and research. The implications are both technical and social.

Monday, October 29, 2007

The Mismeasurement of Science

Peter Lawrence has written an interesting article on the (mis)use of measurement to assess "quality" and/or "impact" of scientists. It's called The Mismeasurement of Science, and appeared in Current Biology, August 7, 2007: 17 (15), r583. You can download it here.

Highly recommended reading, not only for scientists, but also as another interesting example of how a simple-minded approach to measuring "quality" or "productivity" has a wide range of dysfunctional implications. I particularly liked the following:

The measures seemed, at first rather harmless, but, like cuckoos in a nest, they have grown into monsters that threaten science itself. Already, they have produced an “audit society” in which scientists aim, and indeed are forced, to put meeting the measures above trying to understand nature and disease.

I suspect that similarly simple minded application of software engineering measures (such as Active Time in Hackystat) would have similarly disastrous consequences were anyone to actually take them seriously.

Sunday, October 7, 2007

Hackystat and Crap4J

The folks at Agitar, who clearly have a sense of humor in addition to being excellent hackers, have recently produced a plug-in for Eclipse called Crap4J that calculates a measure of your code's "crappiness".

From their web page:

There is no fool-proof, 100% objective and accurate way to determine if a particular piece of code is crappy or not. However, our intuition – backed by research and empirical evidence – is that unnecessarily complex and convoluted code, written by someone else, is the code most likely to elicit a “This is crap!” response. If the person looking at the code is also responsible for maintaining it going forward, the response typically changes into “Oh crap!”

Since writing automated tests (e.g., using JUnit) for complex code is particularly hard to do, crappy code usually comes with few, if any, automated tests. The presence of automated tests implies not only some degree of testability (which in turn seems to be associated with better, or more thoughtful, design), but it also means that the developers cared enough and had enough time to write tests – which is a good sign for the people inheriting the code.

Because the combination of complexity and lack of tests appear to be good indicators of code that is potentially crappy – and a maintenance challenge – my Agitar Labs colleague Bob Evans and I have been experimenting with a metric based on those two measurements. The Change Risk Analysis and Prediction (CRAP) score uses cyclomatic complexity and code coverage from automated tests to help estimate the effort and risk associated with maintaining legacy code. We started working on an open-source experimental tool called “crap4j” that calculates the CRAP score for Java code. We need more experience and time to fine tune it, but the initial results are encouraging and we have started to experiment with it in-house.

Here's a screenshot of Crap4J after a run over the SensorShell service code:

Immediately after sending this link to the Hackystat Hackers, a few of us started playing with it. While the metric seems intuitively appealing (and requires one to use lots of bad puns when reporting on the results), its implementation as an Eclipse plugin is quite limiting. We have found, for example, that the plugin fails on the SensorBase code, not through any fault of the SensorBase code (whose unit tests run quite happily within Eclipse and Ant) but seemingly because of some interaction with Agitar's auto-test invocation or coverage mechanism.

Thus, this seems like an opportunity for Hackystat. If we implement the CRAP metric as a higher level analysis (for example, at the Daily Project Data level), then any combination of tools that send Coverage data and FileMetric data (that provides cyclomatic complexity) can produce CRAP. Hackystat can thus measure CRAP independently of Eclipse or even Java.

The Agitar folks go on to say:

We are also aware that the CRAP formula doesn’t currently take into account higher-order, more design-oriented metrics that are relevant to maintainability (such as cohesion and coupling).

Here is another opportunity for Hackystat: it would be trivial, once we have a DPD analysis that produces CRAP, to provide a variant CRAP calculation that factors in Dependency sensor data (which provides measures of coupling and cohesion).

Then we could do a simple case study in which we run these two measures of CRAP over a code base, order the classes in the code base by their level of crappiness according to the two measures, and ask experts to assess which ordering appears to be more consistent with the code's "True" crappiness.

I think such a study could form the basis for a really crappy B.S. or M.S. Thesis.

Friday, October 5, 2007

Fast Fourier Telemetry Transforms

Dan Port, who has a real talent for coming up with interesting Hackystat research projects, sent me the following in an email today:

Some while back I had started thinking about automated ways analyze telemetry data and it occurred to me that maybe we should be looking at independent variables other than time. That is, we seem to look at some metric vs. time for the most part. This time series view is very tough to analyze and especially automate the recognition of interesting patterns. While dozing off at a conference last week something hit me (no, not my neighbor waking me up). What if we transformed the time-series telemetry data stream into.... a frequency-series. That is, do a FFT on the data. This would be like looking at the frequency spectrum of an audio stream (although MUCH simpler).

I am extremely naive about FFT, but my sense is that this approach basically converts a telemetry stream into a 'fingerprint' which is based upon oscillations. My hypothesis is that this fingerprint could represent, in some sense, a development 'process'.

If that's so, then the next set of questions might be:
  • Is this 'process' representation stable? Would we get the same/similar FFT later or from a different group?
  • Is this process representation meaningful? Are the oscillations indicative of anything useful/important about the way people work? Does this depend upon the actual kind of telemetry data that is collected?
It would be nice to come up with some plausible ideas for telemetry streams that might exhibit 'meaningful' oscillations as a next step in investigating this idea.

Monday, October 1, 2007

The Perils Of Emma

I just had my software engineering class do reviews of each others code where they analyzed the quality of testing from both a white box and a black box perspective. (Recall that in black box testing, you look at the specifications for the system and write tests to verify it obeys the spec. In white box testing, you test based on the code itself and focus on getting high coverage. That's a bit simplified but you get the drift.)

After they did that analysis, the next thing I asked them to do was "break da buggah"---write a test case or two that would crash the system.

Finally, they had to post their reviews to their engineering logs.

Many experienced a mini-epiphany when they discovered how easy it was to break the system---even when it had high coverage. The point that this drove home (reinforcing their readings for the week that included How To Misuse Code Coverage) is that if you just write test cases to exercise your code and improve your coverage, those test cases aren't likely to result in "strong" code that works correctly under both normal and (slightly) abnormal conditions. Some programs with high coverage didn't even implement all of the required command line parameters!

Code coverage tools like Emma are dangerously addictive: they produce a number that appears to be related to code and testing quality. The reality is that writing tests merely to improve coverage can potentially be a waste of time and even counterproductive: it makes the code look like it's well tested when in fact it's not.

Write tests primarily from a black box perspective. (Test-Driven Design gets you this "for free", since you're writing the tests before the code exists for which you be computing coverage.)

When you think you've tested enough, run Emma and see what she thinks of your test suite. If the numbers aren't satisfying, you might want to close your browser window immediately and think harder about your program and the behaviors it can exhibit of which you're not yet aware. Resist the temptation to "cheat" and navigate down to find out which classes, methods, and lines of code weren't exercised. As soon as you do that, you've left the realm of black box testing and are now simply writing tests for Emma's sake.

And that's not a good place to be.

Thursday, September 27, 2007

Representation of Developer Expertise

Developer expertise is generally represented by "years of experience", which is generally useless. Does someone have 10 years of experience, or 1 year of experience 10 times over?

The process and product data collected by Hackystat has the potential to provide a much richer and more meaningful representation of developer expertise. Let's restrict ourselves to the domain of Java software development, for the sake of discussion. First, let's consider the kinds of data we could potentially collect as a developer works:
  • When they are working on a Java system.
  • The packages imported by the class that they are editing.
  • The IDE they are using.
  • The IDE features (debugger, build system, refactoring, configuration management) they are using.
  • Their behaviors within the IDE (writing tests, writing production code, running the system, running the debugger, running tests, invoking the build system, etc.)
  • Their configuration management behavior: frequency of commits, degree of churn; conflicts.
  • The number of developers associated with their current project.
  • The level of interdependency between developers: are files worked on by multiple developers? Has the developer ever worked on code written by someone else? Has someone else ever worked on code written by this developer?
I believe that capturing this kind of information can provide a much richer and more useful representation of developer expertise. It can provide information useful to determining:
  • Who has experience with a particular tool/technique?
  • Who has complementary skills to my own?
  • Who has recent experience with a given tool/technique?
At a higher level, this information could also be useful in forming groups, by helping identify developers with similar and/or complementary skills.

Important research issues include providing the developer with "control" over their profile, how to display this information, and how to perform queries.

Tuesday, September 18, 2007

Twitter, Hackystat, and solving the context switching problem

A recent conversation with Lorin Hochstein at ISERN 2007 has led me to wonder if Twitter + Hackystat solves what we found in prior research to be a fundamental show stopper with the manual metrics collection techniques like the Personal Software Process: the "context switching problem".

Back in the day when we were building automated support for PSP, a basic problem we couldn't solve was developer hatred for having to constantly stop what they were doing in order to tell the tool what they were doing. We called this the "context switching problem", and we came to believe that no amount of tool support for the kind of data that the PSP wants to collect can overcome it, because PSP data intrinsically requires ongoing developer interruption.

I believe a central design win in Twitter is that it does not force a context switch: the messages that you write are so constrained in size and form that they do not interrupt the "flow" of whatever else you're doing. This is fundamentally different from a phone call, a blog entry, an email, or a PSP log entry.

What makes Twitter + Hackystat so synergistic (and, IMHO, so compelling) is that Hackystat sensors can provide a significant amount of useful context to a Twitter message. For example, suppose Developer Joe twitters:

"Argh, I'm so frustrated with JUnit!"

Joe's recent Hackystat event stream could reveal, for example, that he is struggling to resolve a compiler error involving annotations. Developer Jane could see that combination of Twitter and Hackystat information, realize she could help Joe in a couple of minutes, and IM to the rescue.

A second very cool hypothesis is that this combination overcomes the need to "ask questions the smart way". Indeed, Developer Joe is not asking a question or even asking anyone explicitly for help: he is merely expressing an emotion. The additional Hackystat data turns it into a "smart question" that Developer Jane decides to volunteer to answer.

So, how do we create this mashup? I can see at least two different user interfaces:

(a) Hackystat-centric. Under this model, developers in a group use Twitter in the normal way, but Hackystat will also be a member of that community and subscribe to the feed. Then, we create a widget in, say, NetVibes that displays the twitter messages augmented with (possibly abstractions) of the raw sensor data which provides the additional interesting context. Developers then use this UI to monitor the HackyTwitter conversation.

(b) Twitter-centric. In this case, Hackystat abstractions of events are posted to Twitter, which thus becomes part of the normal Twitter feed. People use Twitter in the normal way, but now they are getting an additional stream of Twitter messages representing info from Hackystat.

How might we test these hypotheses? As an initial step, I propose a simple pilot study in which a software engineering class works on a group project in the "normal" way for a month, then installs Hackystat+Twitter and continues on. After the second month, a questionnaire is administered to get feedback from the students on how their communication and coordination changed from month one to month two, and what benefits and problems the introduction of Hackystat + Twitter created.

If this experience is successful, then we refine our experimental method and move on to an industrial case study with more longitudinal data collection. For example, we could build upon the case study by Ko, Deline, and Venolia to see if Hackystat+Twitter reduces the activities engaged in by developers in order to know what their co-workers on a project are currently doing.

What would we call this? I still like the term Project Proprioception.

Sunday, September 16, 2007


I came across the Panopticode project today. It is an interesting approach to metrics aggregation. They motivate their approach by listing the following limitations of current Java metrics tools:
  1. Installation and configuration can be difficult and time consuming
  2. Most only measure the status at a point in time and have no concept of historical data
  3. Each tool represents its data in a proprietary format
  4. It is very difficult to correlate data from different tools
  5. It can be difficult to switch between competing tools that provide similar functions
  6. Most have extremely limited reporting and visualization capabilities
Of course, I agree absolutely. Panopticon provides a way to simplify the download, installation, and invocation of the following tools so far: Emma, JDepend, Checkstyle, JavaNCSS, and provide an interesting visualization of the results called TreeMaps.

There are some substantial differences between their approach and ours in Hackystat:
  • Panopticode limits itself to the world of Java code built using Ant. This is the simplifying assumption they use to achieve (1). Hackystat is agnostic with respect to language and build technology.
  • Current reports do not appear to include history, so I don't know how they plan to do provide (2). Hackystat includes a domain specific language for analysis and visualization of project history called Software Project Telemetry. This also provides a general solution to problem (4) of correlating data from different tools. Panopticode does not appear to provide a solution to (4), at least from perusing the gallery and documentation pages. I will also be interested to see how they create a scalable solution as the toolset grows to, say 30 or 40. This is a hard problem that the Telemetry DSL addresses.
  • While I agree with statement (6) that current reporting tools have an extremely limited reporting and visualization capability, Panopticode seems to currently suffer from that same problem :-) Hackystat, at least with Version 8, will break out of the Java JSP WebApp prison with an architecture well suited to a variety of reporting and visualization approaches, includes Ambient devices, Ruby on Rails, GWT, and so forth. Finally, while TreeMaps are certainly sexy, I don't really see how they are fundamentally better than the unsexy HTML reports of JavaNCSS, Emma, etc. (at least, I don't see it given the way Panopticode uses TreeMaps at present). If I am trying to find low coverage, Emma's HTML interface gets me there just about as easily as the TreeMap does. TreeMaps are cute and all, but they feel more like syntactic sugar than some fundamental interface paradigm shift.
The project is in its bootstrapping phases, so in some sense it's not fair to compare it to Hackystat, which is in its 6th year of public release. I also think it's an interesting decision to limit oneself to Java/Ant, which I think can really simplify certain problems that Hackystat faces in order to appeal to a broader audience. I look forward to seeing how this project progresses in the future.

Thursday, September 6, 2007

CodeRuler Tournament Setup

I keep having to re-learn how to hack the CodeRuler package each semester for the in-class tournament, so here's the steps:

  1. Expand the file in the eclipse/plugins directory. Move the actual jar file out of the way, so that Eclipse will load plugins/
  2. You will need to update two kinds of files in this directory and restart Eclipse to see the changes. The first file is META-INF/ Second, in the directory, there are pairs of .class and .xml files that implement the sample rulers.
  3. Go back into Eclipse, create a Games project, and create a MyRuler. Copy the student MyRuler implementation into your template MyRuler. Then, refactor this MyRuler to be in the package (still in your Games project directory), and with the class name <StudentName>. Save the file to compile the code and generate the <StudentName>.class file in the bin directory.
  4. Finally, copy the <StudentName>.class file into the eclipse/plugins/ directory, create the corresponding .xml file, and update the file.
  5. Restart Eclipse and the student code ruler should now be present as a Sample.
When I do the tournament, I first have the students go through their code in front of the class to explain their strategy. This approach makes it easy to control the tournament and also have the code right at hand for review.

Tuesday, August 28, 2007

Grid Computing, Java, and Hackystat

I just got finished watching a really interesting screencast called "Grid Application in 15 minutes" that features GridGain, a new open source grid computing framework in Java. See their home page for the link to the screencast.

Things I found interesting while watching the screencast:
  • It uses some advanced Java features (Generics, Annotations, AOP) to dramatically simplify the number of lines of code required to grid-enable a conventional application.
  • It is a nice example of how to use the Eclipse framework to maximize the amount of code that Eclipse writes for you and minimize the amount that you have to type yourself.
I think there are some really interesting opportunities in Hackystat for grid computing. Many computations related to DailyProjectData and Telemetry (for example) are "embarrassingly parallel" and GridGain seems like the shortest path to exploiting this domain attribute.

Thursday, August 23, 2007

Project Proprioception

In the latest issue of Wired Magazine, there is an interesting article in defense of Twitter. One thing he says is that you can't really understand Twitter unless you actually do it (which might explain why I don't really understand Twitter.)

He goes on to say that the benefit of Twitter is "social proprioception":

When I see that my friend Misha is "waiting at Genius Bar to send my MacBook to the shop," that's not much information. But when I get such granular updates every day for a month, I know a lot more about her. And when my four closest friends and worldmates send me dozens of updates a week for five months, I begin to develop an almost telepathic awareness of the people most important to me.

It's like proprioception, your body's ability to know where your limbs are. That subliminal sense of orientation is crucial for coordination: It keeps you from accidentally bumping into objects, and it makes possible amazing feats of balance and dexterity.

Twitter and other constant-contact media create social proprioception. They give a group of people a sense of itself, making possible weird, fascinating feats of coordination.

Aha! That makes a lot more sense to me, and also suggests the following hypothesis:

Hackystat's fine-grained data collection capabilities can support "project proprioception": the ability for a group of developers to have "a sense of themselves" within a given software development project.

I think that DevEvents and Builds and so forth support a certain level of project proprioception without any further interaction with the developer. But, what if Hackystat had a kind of "Twitter sensor", in which developers could post small nuggets of information about what they were thinking about or struggling with that could be combined with the DevEvents:

  • "Trying to figure out the JAXB newInstance API"
  • "WTF is with this RunTime Exception?"
  • "General housecleaning for the Milestone Release"
  • "Pair Programming With Pavel"
  • "Reviewing the Ant Sensor"
  • "Upgrading Tomcat"
Now imagine these messages being combined with the other Hackystat DevEvents and being visualized using something like Simile/Timeline. Further, imagine the timeline being integrated into a widget with a near-real-time nature like the Sensor Data Viewer, such that you could see the HackyTwitter information along with occurrences of builds, tests, and commits scrolling by on a little window in the corner of your screen. Would this enable "weird, fascinating feats of coordination" within a software development project?

Sounds cool to me.

Wednesday, August 8, 2007

Web application development

Here's a really nice "screen cast" that compares web development in several different languages/frameworks (J2EE, Ruby on Rails, Zope/Plone, TurboGears, etc.)


A few of the things I found interesting:
  • Presentation style is quite different from standard Powerpoint "Title plus bullet list". I would love to evolve to his style for my lectures.
  • Provides evidence that we made the right choice for the new ICS website. :-)
  • One of the more compelling illustrations I've seen of the differences between Ruby on Rails and Java/J2EE for web development. RoR beats J2EE by a mile, but doesn't win overall
It's fairly long but held my interest all of the way through.

If you want to learn how he puts these presentations together, see here.

Wednesday, July 11, 2007

Empirical Software Engineering and Web 3.0

I came across two interesting web pages today that started me thinking about empirical software engineering in general and Hackystat in particular with respect to the future of web technology.

The first page contains an interview with Tim Berners-Lee on the Semantic Web. In his response to the request to describe the Semantic Web in simple terms, he talks about the lack of interoperability between the data in your mailer, PDA calendar, phone, etc. and pages on the web. The idea of the Semantic Web, I guess, is to add sufficient semantic tagging to the Web to provide seamlessness between your "internal" data and the web's "external" data. So, for example, any web page containing a description of an event would contain enough tagging that you could, say, right click on the page and have the event added to your calendar.

There is a link on that page to another article by Nova Spivak on Web 3.0. It contains the following visualization of the web's evolution:

To me, what's interesting about this is the transition we're now in between Web 2.0, which is primarily focused on user-generated, manual "tagging" of pages, and Web 3.0, where this kind of "external" tagging will be augmented by "internal" tagging that provides additional semantics about the content of the web document.

It seems to me that the combination of internal and external tagging can provide interesting new opportunities for empirical software engineering. Let's go back to Tim Berners-Lee's analogy for a second: it's easy to transpose this analogy to the domain of software development. Currently, a development project produces a wide range of artifacts-- requirements documents, source code, documentation, test plans, test results, coverage, defect reports, and so forth. All of these evolve over time, all are related to each other, and I would claim that all use (if anything) a form of "external" tagging to show relationships. For example, a configuration management system enables a certain kind of "tagging" between artifacts which is temporal in nature. Some issue management systems, like Jira, will parse CV commit messages looking for Issue IDs and use that to generate linkages between Issues and the modifications to the code base related to them.

Nova Spivak adds a few other technologies to the Web 3.0 mix besides the Semantic Web and its "internal" tagging:
  • Ubiquitous connectivity
  • Software as service
  • Distributed computing
  • Open APIs
  • Open Data
  • Open Identity
  • Open Reputation
  • Autonomous Agents
The "Open" stuff is especially interesting to me in light of the recent emergence of "evaluation" sites for open source software such as Ohloh, SQO-OSS, and Coverity/Scan. Each of these sites are trying to address the issue of how to evaluate open source quality. Each of them are more-or-less trying to do it within the confines of Web 2.0.

Evaluation of open source software is an interesting focus for the application of Web 3.0 to empirical software engineering, because open source development is already fairly transparent and accessible to the research community, and also because increasing numbers of open source software are becoming mission-critical to organizational and governmental infrastructure. The Coverity/Scan effort was financed by the Department of Homeland Security, for example.

Back to Hackystat. It seems to me that Hackystat sensors are, in some sense, an attempt to take a software engineering artifact (the sensor data "Resource" field, in Hackystat 8 terminology), and retrofit Web 3.0 semantics on top of it (the SensorDataType field being a simple example). The RESTful Hackystat 8 services are then a way to "republish" aspects of these artifacts in a Web 3.0 format (i.e. as Resources with a unique URI and an XML representation) . What is currently lacking in Hackystat 8 is the ability to obtain a resource in RDF representation rather than our home-grown XML, but that is a very small step from where we are now.

There is a lot more thinking I'd like to do on this topic (probably enough for an NSF proposal), but I need to stop this post now. So, I'll conclude with three research questions at the intersection of Web 3.0 and empirical software engineering:

  • Can Web 3.0 improve our ability to evaluate the quality/security/etc. of open source software development projects?
  • Can Web 3.0 improve our ability to create a credible representation of an open source programmer's skills?
  • Can Web 3.0 improve our ability to create autonomous agents that can provide more help in supporting the software development process?

Tuesday, June 19, 2007

Bile Blog, Google Project Hosting, and Download Deletion

First off, a pretty hilarious Bile Blog posting on Google Project Hosting.

Even better, one of the comments describes how to delete download files:
  1. Click on the "Summary + Labels" link for the file you wish to delete.
  2. Click on the "Click to edit download" link
  3. A "Delete" link will now appear in the toolbar.

Monday, May 28, 2007

Restful Resources

This past week, I've come across a couple really useful resources for REST style architectural development that I can recommend:

The first is the O'Reilly book "Restful Web Services". I'm about halfway through and it has already illuminated some dark corners of Restful web service design, such as:
  • When to use POST vs. PUT. (Use POST when the server is responsible for generating the URI of the associated resource; use PUT when the client is responsible for generating the URI).
  • Authentication.
The authors make a point of distinguishing between REST in general and REST when applied to web service design, for which they describe a set of concrete best practices they call "Resource Oriented Architecture". This is very nice, and reminds me of Effective Java, which provides a set of best practices for Java software development.

The second REST resource I would like to recommend is the "Poster" plugin for FireFox. Poster enables you to make GET, PUT, POST, and DELETE http calls from within FireFox and see the results. It is a nice way to obtain a sanity check on what your web service is doing when you don't quite understand why your unit tests are failing.

Tuesday, May 22, 2007

Hackystat on Ohlo

I came across Ohlo recently, and decided to create Ohlo projects for Hackystat-6, Hackystat-7, and Hackystat-8. Ohlo is a kind of directory/evaluation service for Open Source projects that generates statistics by crawling the configuration management repository associated with the project. It also generates some pretty interesting data about individual committors.

There's a lot of things I found interesting about the Hackystat-7 Ohlo project:
  • The Hackystat development history is quite truncated and only goes back a year and a half (basically when we switched to Subversion). I consulted the FAQ, where I learned that if I also point Ohlo at our old CVS repository for Hackystat 6, it will end up double counting the code. Oh well. That's why there's three unconnected projects for the last three versions of Hackystat.
  • They calculate that the Hackystat-7 code base represented 65 person-years of effort and about $3.5M investment. I think that's rather low, but then again, they only had 18 months of data to look at. -)
  • There is more XML than Java in Hackystat-7. That's a rather interesting insight into the documentation burden associated with that architecture. I hope we can reduce this in Hackystat-8.
  • The contributor analyses are very interesting as well, here's mine. This combines together the stuff from all three Hackystat projects. I find the Simile/Timeline representation of my commit history particularly cool.
There are a number of interesting collaborative possibilities between Hackystat and Ohlo, which I will post about later. If you have your own ideas, I'm all ears.

Finally, it seems pretty clear from their URLs that they are using a RESTful web service architecture.

There are several other active CSDL open source projects that we could add to Ohloh: Jupiter, LOCC, SCLC.

Friday, May 11, 2007

Sample Restlet Application for Hackystat

I decided to get my feet wet with Restlet by building a small little server with the following API:


The idea is that it will retrieve and display {filename}, which is an instance of the "file" resource.

This was a nice way to wade into the Restlet framework; more than a Hello World app, lacking stuff we don't need (like Virtual Hosts), and requiring stuff we do (URL-based dispatching).

To see what I did, check out the following 'samplerestlet' module from SVN:


To build and run it:
  1. Download Restlet-1.0, unzip, and point a RESTLET_HOME env variable at it.
  2. Build the system with "ant jar"
  3. Run the result with "java -jar samplerestlet.jar"
This starts up an HTTP server that listens on port 9876.

Try retrieving the following in your browser: http://localhost:9876/samplerestlet/file/build.xml

Now look through the following three files, each only about 50 LOC:
  1. build.xml, which shows what's needed to build the executable jar file.
  2., which creates the server and dispatches to a FileResource instance to handle URLs satisfying the file/{filename} pattern.
  3., which handles those GET requests by reading the file from disk and returning a string representation of it.
If this looks confusing, the Restlet Tutorial is a reasonable introduction. There's also a pretty good Powerpoint presentation that introduces both REST architectural design and the Restlet Framework at the Restlet Wiki Home Page. It comes with some decent sample code as well.

Next step is to add this kind of URL processing to SensorBase.

Wednesday, May 9, 2007

How to browse HTML correctly from SVN repositories

I just committed a bunch of HTML files to SVN, then realized that they don't display as HTML when you browse the repository. After painfully reconstructing the solution, I figured it would be good to jot a note on how to deal with this.

First, you need to fix the svn:mime-type property on all of your committed HTML files. To do this, use your SVN client to select all of the HTML files under SVN control, then set their svn:mime-type property to text/html, then commit these changes. That fixes the current files.

To ensure that all of your current and future HTML files are committed from the get-go with the svn:mime-type property set to text/html, you have to use the SVN auto-props feature. What that basically means is that you have to edit the file called "config" in your local SVN installation, and uncomment the following line:

enable-auto-props = yes

Then you have to add a new line that looks like this:

*.html = svn:mime-type=text/html

Finally, you (potentially) need to instruct your SVN client to consider auto-props when doing its commits. For example, in SmartSVN, you have to go to Projects | Default Settings | Working Copy and check "Apply Auto-Props from SVN 'config' file to added files".

In TortoiseSVN, there is a "Setting" menu that allows you to edit the 'config' file in a similar manner.

Tuesday, May 8, 2007

JAXB for Hackystat for Dummies

I spent today working through the XML/Java conversion process for SensorBase resources, and it occurred to me near the end that my struggles could significantly shorten the learning curve for others writing higher level services that consume SensorBase data (such as the UI services being built by Alexey, Pavel, and David.)

So, I did a quick writeup on the approach, in which I refer to a library jar file I have made available as the first SensorBase download.

After so many years using JDOM, which was nice in its own way, it is great to move onward to an even faster, simpler, and easier approach.

Saturday, May 5, 2007

How to start a new software development project

Alexey made an engineering log post in which he wonders how to get started with a summer job in which he will be asked to develop a "simple client-server system" for decision process simulation. Here's my advice:

1. Create a Google project to host all of your code and documentation. You're going to need to put stuff somewhere. Putting it in a public repository is good for at least two reasons: (a) you get a lot of useful infrastructure (svn, mailing lists, issue tracking, wiki) for free, and (b) your sponsors will feel better about you by having open access to what you're doing. Such transparency is a good thing: it will encourage more effective communication between them and you about the status of the project. If you just show up each week for a meeting and say, "Things are going good", it's easy for the project progress to stall for quite a while before that's apparent. If your project is hosted, then you can show up each week, review with them the issues that you're working on, and show them demos or code or requirements or whatever. The more they understand what you're doing at all points in the process, the happier they will be with you and the more likely you are to succeed.

2. Once you have your project repository infrastructure set up, create a wiki page containing the design of a REST API for client-server communication. Having just created a REST API for the SensorBase, which is itself a "simple client-server system", I can heartily recommend this approach to exploring the requirements for your system. Basically, start asking yourself what the "resources" are in your application, and how the clients will manipulate these resources on the server. At this point, you don't worry too much about the specific look-and-feel of the interface; you're more focussed on the essential information instances and their structure. Of course, you can and should get feedback from your sponsors about your proposed set of resources and the operations available upon them. Having this API specification available makes getting into coding a breeze, as I discovered yesterday. I previously posted a few links that I found useful in learning about REST.

3. Once you feel comfortable with your API and thus understand what resources exist and how they are manipulated, create a mockup of the user interface. This helps you figure out what user interface you need, and what technology you might want to use---GWT, Ruby on Rails, plain old Java, or even .NET. Since REST is an architectural "style", not a technology, your work defining the API will not be wasted regardless of what user interface infrastructure you choose.

4. Apply the project management and quality assurance skills you learned in 613. Create unit tests and monitor your coverage. Associate each SVN commit with an Issue so that your issue list also constitutes a complete Change Log. Create intermediate milestones that you review with your sponsor. Request code reviews from fellow CSDL members. Maintain an Engineering Log with regular entries, and encourage your sponsors to read it so that they know what you're thinking about and the problems you are encountering. Use static and dynamic analysis tools to automate quality assurance whenever possible; for example, if you are programming in Java, use Checkstyle, PMD, FindBugs, and the Eclipse warnings utility.

5. Finally, be confident in your ability to learn what is required to successfully complete this project. A common project risk factor is developers who feel insecure about not knowing everything they need to know about the application domain in advance, and as a result try to "hide" that fact from their sponsors. In any interesting development project, there are going to be things you don't know in advance, and paths you follow that turn out to be wrong and require backtracking. The goal of software engineering practices is to make those risks obvious, and put in place mechanisms to discover the wrong paths as fast as possible. You will make mistakes along the way. Everyone does, we're only human.

I hope that you will have the experience Pavel had when instituting these same kinds of practices for his bioinformatics RAship. He told me that his sponsor was delighted by his use of Google Projects and an Engineering Log to make his work more visible and accessible. I would love to see a similar outcome for you in this project.

Friday, May 4, 2007

SensorBase coding has begun!

To my great delight (given that the May 15 milestone is rapidly approaching) I have committed my first bit of SensorBase code today.

Some interesting tidbits:

First, I am continuing to observe the Hackystat tradition of always including a reference to an Issue in the SVN commit message. In this case, the reference looks like:

Second, to my surprise, I am coding 100% in a TDD style, not out of any philosophical commitment or moral imperative, but simply out of the sense that this is just the most natural way to start to get some leverage on the SensorBase implementation. The REST API specification turns out to form a very nice specification of the target behavior, and so I just picked the first URI in the table (GET host/hackystat/sensordatatypes) which is supposed to return a list of sensordatatype resource references, and wrote a unit test which tries that call on a server. Of course, the test fails, because I haven't written the server yet.

Third, to my relief, the Restlet framework makes that test case wicked easy to write. In fact, here it is:

@Test public void getSdtIndex () {
// Set up the call.
Method method = Method.GET;
Reference reference = new Reference("http://localhost:9090/hackystat/sensordatatypes");
Request request = new Request(method, reference);

// Make the call.
Client client = new Client(Protocol.HTTP);
Response response = client.handle(request);

// Test that the request was received and processed by the server OK.
assertTrue("Testing for successful status", response.getStatus().isSuccess());

// Now test that the response is OK.
XmlRepresentation data = response.getEntityAsSax();
assertEquals("Checking SDT", "SampleSDT", data.getText("SensorDataTypes/SensorDataType/@Name"));
There's a couple of rough edges (I can't hard code the server URI, and my XPath is probably bogus), but the code runs and does the right thing (i.e. fails at the getStatus call with a connection not found error.)

I'm sure things won't be this easy forever, but it's nice to get off to a smooth start.

Thursday, May 3, 2007

Minimize library imports to your Google Projects

As we transition to Google Project Hosting, one thing we need to be particularly careful about is uploading of third party libraries into SVN. In general, try to avoid doing this. There are two reasons for this. First, there is limited disk space in Google Project Hosting, and its easy to burn up your space with libraries (remember that since SVN never deletes, libraries that need updating frequently will burn through your space quickly.) Second, different services will often share the same library. For example, most of our Java-based services will probably want to use the Restlet framework. It is generally better to install that in one place as a developer.

To avoid uploading libraries to SVN, you can generally do one of the following alternatives:

  1. Instruct your developers in the installation guide to download the library to a local directory, create an environment variable called {LIBRARY}_HOME, and point to those jar files from your Eclipse classpath or Ant environment variable.
  2. For files that need to be in a specific location in your project, such as GWT, download the GWT to a local directory, then copy the relevant subdirectories into your project.
Binary distributions of releases is a different situation. In that case, we will typically want to bundle the libraries into the binary distribution. That will cause its own difficulties, since Google Project Hosting limits us to 10MB files for the download section, but we'll cross that bridge when we come to it.

H4 with Robert

I had an entertaining and enjoyable H4 with Robert yesterday. We spent the time fooling around with the sample Restlet framework applications. I was a bit worried about whether we would be able to make progress, since the samples were almost totally undocumented (the documentation points you to a directory containing sample code, which turns out to be sample code from an as-yet-unpublished O'Reilly book on the Restlet framework.)

Presumably with the book in our hands, we would have had complete instructions on how to run the examples.

Without the book in our hands, we just blasted ahead, created an Eclipse project, imported the code, looked for public static main() methods, and ran them.

To my surprise and delight, after some traipsing around the lib directory and guessing at jar files to add to the classpath, we eventually got all of the sample code to run.

So, the bad news is: the Restlet Framework examples are poorly documented: buy the book when it comes out. The good news is: given a few lucky guesses, you really don't need any documentation, at least to get them up and running.

Monday, April 30, 2007

Xml Schema definition for dummies

Today I defined my first batch of Xml Schemas for Version 8. The results of my labors are now available at

For each XSD file, I also provide a couple of "example" XML files, available in

To test that the XSD validates the XML, I used the online DecisionSoft Xml Validator. Provide it with an XSD schema definition file, and an XML file to validate against it, and away it goes. The error messages were a little obtuse, but good enough for my purposes.

It's possible to include a reference to the XSD file within the XML instance, which is probably what we want to do in practice.

The next step is to parse the XML. Here's a nice example of using JAXB 2.0 to parse XML efficiently (both in terms of code size and execution time).

Saturday, April 28, 2007

Version 8 Design Progress

Lots of progress this week on the design of Version 8. There is a milestone on May 15, just over two weeks away, and I've been fleshing out a bunch of pages in order to add some details and direction as to what we might want to try to accomplish by then.

Finally, I gave a talk on REST in ICS 414 yesterday, and noticed the following blog entry by Josh about REST in general and the implications for Hackystat understandability and usability in particular. This gives me hope that we're heading in the right direction!

Wednesday, April 25, 2007

H4 with David

Had an excellent H4 session with David last week, but unfortunately spaced out blogging about it until now. We spent the time looking over his JEdit sensor and trying to figure out how to get it to be correctly noticed as a plugin by JEdit at startup time. Found one significant problem during the session (the plugin main class was not named correctly), and one more significant problem after the session (the build directory was being used to store source code in the SVN repository.)

What's the morale of this story? Basically the obvious one: two heads are better than one, and the process of explaining your code to someone else has the potential to be an excellent way to reveal issues and problems in a very cost-effective manner.

Friday, April 20, 2007

Near real time communication using Restlet

Some services (such as a UI service to watch the arrival of sensor data at a SensorBase) want "near real time" communication, using something like Jabber. There is a new project that integrates XMPP and Restlet which might be quite useful for this:

David might want to check this out.

Wednesday, April 18, 2007

Why REST for Hackystat?

Cedric asked a really good question on the hackystat mailing list today, and I thought it was worth posting to this blog:

> Probably my question is too late since you have already decide use REST, but I want to
> know the rationale behind it.
> Since you are still returning data in xml format, what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Excellent question! No, it's not too late at all. This is exactly the right time to be discussing this kind of thing.

It turns out that when I started the Version 8 design process, I was still thinking in terms of a monolithic server and was heading down the SOAP/WSDL route. I was, for example, investigating Glassfish as an alternative to Tomcat due to its purportedly better support for web services.

Then the Version 8 design process took an unexpected turn, and the monolithic server fragmented into a set of communicating services: SensorBase services for raw sensor data, Analysis services that would request data from SensorBases and provide higher level abstractions, and UI services that would request data from SensorBases and Analyses and display it with a user interface.

What worried me about this design initially was that every Analysis service would have to be able to both produce and consume data (kind of like being a web server and a web browser at the same time), and that Glassfish might be overkill for this situation. So, I started looking for a lightweight Java-based framework for producing/consuming web services, and came upon the Restlet Framework (, which then got me thinking more deeply about REST.

It's hard to quickly sum up the differences between REST and WSDL, but here's a few thoughts to get you started. WSDL is basically based upon the remote procedure call architectural style, with HTTP used as a "tunnel". As a result, you generally have a single "endpoint", or URL, such as /soap/servlet/messagerouter, that is used for all communication. Every single communication with the service, whether it is to "get" data from the service, "put" data to the service, or modify existing data is always implemented (from an HTTP perspective) in exactly the same way: an HTTP POST to a single URL. From the perspective of HTTP, the "meaning" of the request is completely opaque.

In REST, in contrast, you design your system so that your URLs actually "mean" something: they name a "resource". Furthermore, the type of HTTP method also "means" something: GET means "get" a representation of the resource named by the URL, "POST" means create a new resource which will have a unique URL as its name, DELETE means "delete" the resource named by the URL, and so forth.

For example, in Hackystat Version 7, to send sensor data to the server, we use Axis, SOAP, and WSDL to send an HTTP POST to, and the content of the message indicates that we want to create some sensor data. All sensor data, of all types, for all users, is sent to the same URL in the same way. If we wanted to enable programmatic access to sensor data in Version 7, we would tell clients to continue to use HTTP POST to, but tell them that the content of the POST could now invoke a method in the server to obtain data.

A RESTful interface does it differently: to request data, you use GET with an URL that identifies the data you want. To put data, you use POST with an URL that identifies the resource you are creating on the server. For example:


might return the Commit sensor data with timestamp 1176759070170 for user x3fhU784vcEW. Similarly,


would contain a payload with the actual Commit data contents that should be created on the server. And


would delete that resource. (There are authentication issues, of course.)

In fact, REST asserts a direct correspondance between the CRUD (create/read/update/delete) DB operations and the POST, GET, PUT, and DELETE methods for resources named by URLs.

Now, why do we care? What's so good about REST anyway? In the case of Hackystat, I think there are two really significant advantages of a RESTfully designed system over an RPC/SOAP/WSDL designed system:

(1) Caching can be done by the Internet. If you obey a few more principles when designing your system, then you can use HTTP techniques as a way to cache data rather than build in your own caching system. It's exactly the same way that your browser avoids going back to Amazon to get the logo files and so forth when you move between pages. In the case of Hackystat, when someone invokes a GET on the SensorBase with a specific URL, the results can be transparently cached to speed up future GETs of the same URL, since that represents the same resource. (There are cache expiration issues, which I'm pretty sure we can deal with.)

In Hackystat Version 7, there is a huge amount of code that is devoted to caching, and this code is also a huge source of bugs and concurrency issues. With a REST architecture, it is possible that most, perhaps all, of this code can be completely eliminated without a performance hit. Indeed, performance might actually be significantly better in Version 8.

(2) A REST API is substantially more "accessible" than a WSDL API. One thing I want from Hackystat Version 8 is a substantially simpler, more accessible interface, that enables outsiders to quickly learn how to extend Hackystat for their own purposes with new services and/or extract low-level or high-level data from Hackystat for their own analyses. To do this with a RESTful API, it's straightforward: here are some URLs, here's how they translate into resources, invoke GET and you are on your way. Pretty much every programming language has library support for invoking an HTTP GET with an URL. One could expect a first semester programming student to be able to write a program to do that. Shoots, you can do it in a browser. The "barrier to entry" for this kind of API is really, really low.

Now consider a WSDL API. All of a sudden, you need to learn about SOAP, and you need to find out how to do Web Services in your chosen programming language, and you have to study the remote procedure calls that are available, and so forth. The "barrier to entry" is suddenly much higher: there are incompatible versions of SOAP, there's way more to learn, and I bet more than a few people will quickly decide to just bail and request direct access to the database, which cuts them out of 90% of the cool stuff in Hackystat.

So, from my reckoning, if we decided to use Axis/SOAP/WSDL in Version 8, we'd (1) continue to need to do all our own caching with all of the headaches that entails, and (2) we'd be stuck with a relatively complex interface to the data.

I want to emphasize that a RESTful architecture is more subtle than simply using GET, POST, PUT, and DELETE. For example, the following is probably not restful:

GET http://foo/bar/baz&action=delete

For more details, has a good intro with pointers to other readings.

Your email made another interesting assertion:

> what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Although I agree that WSDL is an "industry standard", this doesn't mean that REST isn't one as well. Indeed, my sense after a few weeks of research on the topic is that most significant industrial players have already moved to REST or offer REST as an alternative to WSDL: eBay, Google, Yahoo, Flickr, and Amazon all have REST-based services. I recall reading that the REST API gets far more traffic than the correponding WSDL API for at least some of these services.

Finally, no architecture is a silver bullet, and REST is no exception. For example, if you can't effectively model your domain as a set of resources, or if the CRUD operations aren't a good fit with the kinds of manipulations you want to do, then REST isn't right. Another REST requirement is statelessness, which can be a problem for some applications. So far in my design process, however, I haven't run into any showstoppers for the case of Hackystat.

Version 8 is still in the early stages, and the advantages of REST are still hypothetical, so I'm really happy to have this conversation. There are no hard commitments to anything yet, and if there turns out to be a showstopping problem with REST, then we can of course make a change. The more we talk about it, the greater the odds we'll figure out the right thing.


Monday, April 16, 2007

Version 8 now appearing on Google Projects

I am happy to announce that the Hackystat Version 8 repository is starting to take shape as a set of related projects in Google Project Hosting.

The "hub" project is hackystat which does not contain any code, but does contain wiki pages with high level planning and design documents.

The hub project also contains links to individual Google Project Hosting projects that have been set up to manage implementation of some of the initial Version 8 services. These projects are: hackystat-sensorbase-uh, hackystat-sensor-shell, hackystat-sensor-xmldata, and hackystat-ui-sensordataviewer. These projects don't actually contain any code yet, either.

Note that there are conventions for naming the Hackystat Version 8 projects.

My first focus of attention is on the SensorBase, and I am currently trying to design the REST API for that service. Details at 11.

Friday, April 13, 2007

CSDL members, please read this immediately!

Now that we're two weeks into our blog experiment, I want to do a test of how fast and effectively information is disseminated through the group through this mechanism.

If you are in CSDL, and you are reading this, please reply to this blog posting immediately with the comment "I read it.". I want to find out the following:
  • Will everyone in CSDL reply to this comment? Who is actually reading other members' blogs?
  • How long does it take for a new blog comment to be read by other members of the group?
For this test, please do not verbally inform other CSDL members of this experiment.

I will display this blog entry and discuss the results at next Wednesday's meeting.

Tuesday, April 3, 2007

Java 5 Conversion Week

Having gotten through almost all of the Core subsystem, I think it's time to try allowing all the CSDL Hackystat Hackers to have some fun with Java 5 conversion. So, I'm declaring this week "Java 5 Conversion Week", and the goal is to update the last module in the Core subsystem and all of the code in the SDT subsystem so that no warnings remain in Eclipse (using the default settting for warnings generation, which also corresponds to the Java 5 compiler warnings setting.)

The chart at left lists all of the modules to be worked on, the number of warnings currently present in each, and the developer assigned to eliminate the warnings from them. As you can see, there are about 900 warnings total, and six developers available to work on them, so that results in about 150 warnings per developer. Because it's simplest to assign the work on a module basis, I ended up giving Hongbing and I a little more to do than everyone else, but at the end of the day I don't think the differences will add up to much difference in the amount of time spent on the work.

I will create Jira issues for each developer listing the modules they are assigned and referencing this blog entry for more details. Also, please be sure to review my other blog entry on Java 5 Conversion for additional hints on how to carry out this process.

Good luck, have fun, send email if you run into problems, and feel free to send out an email when you've committed your last batch of changes!

Monday, April 2, 2007

Hackystat 8 and Net Option Values

In The Option Value of Modularity in Design, Carliss Baldwin and Kim Clark argue that appropriate modularity can impact on adoption of a system over other alternatives due to increases in the "net option value" of the design.

It occurs to me that Hackystat 8, by decomposing the current "black box" of the server into a set of loosely coupled, independently evolvable components, will enable new degrees of freedom in the design evolution of the system both within the CSDL research group and externally in the software engineering community.

Sunday, April 1, 2007

Archive TimeZone problem and its workaround

The problem we have been experiencing with the Archive listing being off from the blog entry listing is well known (and particularly prevalent in Hawaii!) See the thread here to follow the discussion and find out when the blogger folks implement the fix.

The temporary workaround is to use Pacific time, which of course makes the posting time off by a few hours. I've changed my blog to Pacific time and, voila, the problem disappears.

Thursday, March 29, 2007

Hackystat UI: Swivel Google Gadgets

Swivel is a site where users can upload data sets and combine contributed data sets in various ways.

What I discovered today is their interface to the Google Home Page via Google Gadgets. I think this is a nice example of how simple it could be to provide a Version 8 Hackystat user interface via Google Gadgets.

Check it out here.

Wednesday, March 28, 2007

REST and web services

Some useful links to understand Representational State Transfer:

REST seems like an appropriate architectural style for the Version 8 web service component.

The framework I am most interested in evaluating for Java-based REST components is Restlet.

Tuesday, March 27, 2007

Hackystat UI: Wesabe and Social Software Metrics

Robert recently pointed me to Wesabe, which is a social networking site focusing on personal finances. This is an interesting site to compare/contrast with Hackystat, since it:
  • Deals with numbers and "metrics".
  • Requires members to share aspects of very personal information (finances) in order to exploit the potential of social networks.
Their help guide is in the form of YouTube videos, which is a little weird (or maybe the wave of the future). I will show some blurry screen shots to illustrate some of the interesting aspects of this tool. This first one shows the top-level organization of your Wesabe account, which has three tags: Accounts, Tips, and Goals.

Accounts basically corresponds to your "raw sensor data" in Hackystat. In Wesabe, you are expected to upload your bank and credit card information.

Tips correspond to information supplied by other users based upon analysis of your account data. The idea is that the raw financial data is parsed to find out what you spend money on.

For example, if you have gas charges, then you will be hooked up with Tips on how to save money on gas. They use a keyword-based mechanism to hook together account data with tip data.

The tips could be generic (don't buy premium gas) or more specific (Don't buy gas from the gas station you're going to; they are a rip-off).

Here's a screen shot of a drill-down into an account, along with the tips and keywords associated with it. Often, you will need to manually annotate your raw financial data in order for Wesabe to start to work its magic on it. You can also see from this screen shot that individual financial items can be rated, and you can also see whether other Wesabeans have recorded a similar kind of purchase.

While tips are a kind of "bottom up" mechanism for producing "actionable" information from your raw account data, goals are more of a top-down approach, in which you first specify your high level goal, and then you get hooked up with other users interested in the same approach.

In Wesabe, it seems that the main focus is to direct you into existing discussion forums rather than explicitly connect you to your financial data. For example, a goal would be something like "Start a College Savings fund for my kids.

So, how does this all relate to Hackystat? I think there are some intriguing possibilities. First, Hackystat currently allows data to be "shared" only within the context of a Project---if there are multiple members of a Project, then they can potentially see each other's data. Wesabe illustrates how you might think about "sharing" on a more global level. The idea is that you don't share the actual financial information: no one knows where you shopped or how much spent, but via the social bookmarking mechanism, the system can hook you up with "tips" (specific actionable items) or "goals" (a community of people with the same intentions).

To explore how this might work, let's imagine some possible "Tips" from the realm of Java software development:
  • How to convert from Java 1.4 to Generics in Java 5 (See my previous blog posting on this.)
  • Proper use of concurrency mechanisms.
  • Diagnosing a null pointer problem.
Hmm. These tips all seem to require more context than is typically provided by Hackystat data. One could image a sensor data type that provides data on the import statements associated with a file you are editing. That would give some insight into the kinds of libraries you are using, which might enable you to be hooked up with helpful tips. Another sensor data type might provide the stack trace and error message associated with a thrown exception.

Now let's think of possible "Goals"
  • Reduce the number of daily build failures.
  • Reduce the time required for running all unit tests.
  • Improve the quality of code.
  • Improve the scalability of the database.
Some of these might be inferable from the kinds of telemetry charts you are monitoring, for example.

In any case, Wesabe indicates an interesting research direction for Hackystat: create the capability for users to add keywords to their data, and then process these keywords as a way to hook users with common interests and mutually useful skills with each other.

Hackystat UI : Telemetry and Alexa charts

Alexa is a site that provides information relating to site traffic. Hongbing sent out a link recently to this site with a query as to how they could produce PNG charts so quickly. I assume that with enough CPU and network bandwidth, anything is possible. What I was personally struck by is their user interface, which rather elegantly supports a lot of the features we want from telemetry. Consider the following screen image from their site, which I have annotated with a 1, 2, 3, and 4.

UI Feature (1) is the tabs, which provide various perspectives on the set of sites. From the Telemetry perspective, this is analogous to a set of related Charts. Thus, what they've done is provided the equivalent of the Telemetry Report interface, but in a much nicer package. Instead of scrolling down through an endless series of charts, you click on a tab to see the related chart. Much, much nicer.

UI Feature (2) is the "Range" selection. This is analogous to our "Day", "Week", "Month" interval selection mechanism. While it is not as flexible as ours, it provides easier access to common interval requests: Last 7 days, Last 1 Month, Last 3 months, etc.

UI Feature (3) is the "See Traffic Details". This is analogous to our "Daily Project Summary" drilldown (or maybe a "Project Status To Date" analysis.

UI Feature (4) is the ability to easily add and subtract different trend lines. This is interesting when translated to the Telemetry domain, because it could be interpreted in one of two different ways: (a) add/subtract one or more telemetry streams, or (b) add/subtract one or more Projects. Indeed, we might want to think of providing both abilities: you could specify what telemetry streams you want to display, and then this set of telemetry streams would be specified for each Project you specify. Thus the number of lines appearing on the chart would be the number of streams times the number of projects. In most cases, you will probably want to have one stream and multiple Projects, or multiple streams for one Project.

Monday, March 26, 2007

Software Development Communication Media

As I have maintained this online engineering log over the past couple of weeks, I have started to think more generally about the types of media used in a software development team to communicate and coordinate activities:

(1) Requirements and Design documents. These are relatively static documents, providing a high-level perspective. Relatively non-interactive.

(2) Mailing lists. Provide a forum for threaded communication. Generally ASCII oriented. Generally interactive.

(3) JavaDocs. Generated from the code itself. Communicates API-level information. Non-interactive. Context-sensitive with the code.

(4) Engineering logs. Semi-interactive way for an individual developer to document design issues, questions, and so forth. Others could potentially comment on log entries.

(5) CM Commit messages. Provides a record of the changes made to a set of sources.

(6) Issues. Provides a decomposition of the high level requirements/design documents into a set of tasks. Also records bugs found.

(7) IM. Generally non-persistent, highly interactive means for developers to get immediate, synchronous feedback and help.

(8) Face to face meetings.

Recently, I was thinking about an idea that I wanted to discuss with the Hackystat software development team, and I was unsure of how to do it:
  • Send it as an email to the developer's mailing list?
  • Write about it in my online engineering log?
  • Write about it in my online engineering log, then send a link to that entry in an email to the developer's mailing list?

Wednesday, March 21, 2007

Java 5 Conversion Notes

Having finished hackyCore_Kernel, my basic strategy for updating Hackystat code to 1.5 is currently the following:

(0) SVN update, then 'ant -q freshStart all.junit'. Make sure the system isn't busted before you start busting on it. (Be sure to configure the file to include the modules you will be working on.)

(1) Use Eclipse to identify a class containing at least one warning.

(2) Fix instance variables. Navigate to that class, then go to the top of the file and check for any collection classes as instance variables. If present, add type information. For example,
private static TreeMap  numOfPeriodsMap = null;
private static TreeMap<String, String> numOfPeriodsMap = null;
Note that this often requires some hunting through the file to determine the kinds of objects being added to the collection.

(3) Fix collection references. Continue through source code, updating references to collection classes to include type information. For example,
numOfPeriodsMap = new TreeMap(new StringIntegerComparator());
numOfPeriodsMap = new TreeMap<String, String>(new StringIntegerComparator());
(4) Update comparators to include type information. For example,
public class StringIntegerComparator implements Comparator {
public int compare(Object o1, Object o2) {
public class StringIntegerComparator implements Comparator<String> {
public int compare(String o1, String o2) {
(5) Update method signatures to include type information. For example,
public static TreeMap getYearOptions() {
public static TreeMap<String, String> getYearOptions() {
(6) Remove occurrences of "old style" for loops. For example,
Set analysisNameList = manager.getAnalysisNames();
for (Iterator i = analysisNames.iterator(); i.hasNext();) {
String analysisName = (String);
String enabled = request.getParameter(analysisName);
Set<String> analysisNameList = manager.getAnalysisNames();
for (String analysisName : analysisNames) {
String enabled = request.getParameter(analysisName);
In some situations, it doesn't make sense to update them. For example, I've seen loops where the next() method was called twice in each body (the list contained "pairs" of objects that were operated on two at a time.) In this case, you must leave it as an "old style" loop.

(7) Implement Iterable<T> if necessary. In some cases, to accomplish (6) you must have a class implement "Iterable". For example, the SdtManager class should enable you to iterate across all instances of SensorDataTypes using the following for/in loop:
for (SensorDataType sdt : SdtManager.getInstance())
To accomplish that, the SdtManager class had to be changed from:
public class SdtManager  {
public Iterator iterator() {
public class SdtManager implements Iterable<SensorDataType> {
public Iterator<SensorDataType> iterator() {

(8) Remove vestigial casts. After adding in the type information, you will hopefully be able to remove casts.

(9) Remove vestigial imports. When you're all done with a class, you will hopefully need to remove imports. For example:
import java.util.Iterator;

(10) Use @SuppressWarnings for the SerialVersionUID warning. There's just no reason to add this instance variable for Hackystat code; it will not be serialized.

(11) Continue until Eclipse reports no warnings for this class.

Note that sometimes this strategy requires working on several classes at once when they are interdependent.

(12) 'ant -q freshStart all.junit', then SVN commit. Be sure the system is OK, then commit your changes du jour. I found that my Java 5 updates would sometimes create Checkstyle errors, so be sure to do a 'freshStart'. (Double check that the file includes the modules you have been working on.)

Also, I'm cleaning up documentation, removing the @version tag that is a relic of the CVS days, etc. as I work on the class. As long as I'm touching it, I might as well make the JavaDocs better and fix any coding bogosities I encounter.