Monday, October 29, 2007

The Mismeasurement of Science

Peter Lawrence has written an interesting article on the (mis)use of measurement to assess "quality" and/or "impact" of scientists. It's called The Mismeasurement of Science, and appeared in Current Biology, August 7, 2007: 17 (15), r583. You can download it here.

Highly recommended reading, not only for scientists, but also as another interesting example of how a simple-minded approach to measuring "quality" or "productivity" has a wide range of dysfunctional implications. I particularly liked the following:

The measures seemed, at first rather harmless, but, like cuckoos in a nest, they have grown into monsters that threaten science itself. Already, they have produced an “audit society” in which scientists aim, and indeed are forced, to put meeting the measures above trying to understand nature and disease.

I suspect that similarly simple minded application of software engineering measures (such as Active Time in Hackystat) would have similarly disastrous consequences were anyone to actually take them seriously.

Sunday, October 7, 2007

Hackystat and Crap4J

The folks at Agitar, who clearly have a sense of humor in addition to being excellent hackers, have recently produced a plug-in for Eclipse called Crap4J that calculates a measure of your code's "crappiness".

From their web page:

There is no fool-proof, 100% objective and accurate way to determine if a particular piece of code is crappy or not. However, our intuition – backed by research and empirical evidence – is that unnecessarily complex and convoluted code, written by someone else, is the code most likely to elicit a “This is crap!” response. If the person looking at the code is also responsible for maintaining it going forward, the response typically changes into “Oh crap!”

Since writing automated tests (e.g., using JUnit) for complex code is particularly hard to do, crappy code usually comes with few, if any, automated tests. The presence of automated tests implies not only some degree of testability (which in turn seems to be associated with better, or more thoughtful, design), but it also means that the developers cared enough and had enough time to write tests – which is a good sign for the people inheriting the code.

Because the combination of complexity and lack of tests appear to be good indicators of code that is potentially crappy – and a maintenance challenge – my Agitar Labs colleague Bob Evans and I have been experimenting with a metric based on those two measurements. The Change Risk Analysis and Prediction (CRAP) score uses cyclomatic complexity and code coverage from automated tests to help estimate the effort and risk associated with maintaining legacy code. We started working on an open-source experimental tool called “crap4j” that calculates the CRAP score for Java code. We need more experience and time to fine tune it, but the initial results are encouraging and we have started to experiment with it in-house.

Here's a screenshot of Crap4J after a run over the SensorShell service code:

Immediately after sending this link to the Hackystat Hackers, a few of us started playing with it. While the metric seems intuitively appealing (and requires one to use lots of bad puns when reporting on the results), its implementation as an Eclipse plugin is quite limiting. We have found, for example, that the plugin fails on the SensorBase code, not through any fault of the SensorBase code (whose unit tests run quite happily within Eclipse and Ant) but seemingly because of some interaction with Agitar's auto-test invocation or coverage mechanism.

Thus, this seems like an opportunity for Hackystat. If we implement the CRAP metric as a higher level analysis (for example, at the Daily Project Data level), then any combination of tools that send Coverage data and FileMetric data (that provides cyclomatic complexity) can produce CRAP. Hackystat can thus measure CRAP independently of Eclipse or even Java.

The Agitar folks go on to say:

We are also aware that the CRAP formula doesn’t currently take into account higher-order, more design-oriented metrics that are relevant to maintainability (such as cohesion and coupling).

Here is another opportunity for Hackystat: it would be trivial, once we have a DPD analysis that produces CRAP, to provide a variant CRAP calculation that factors in Dependency sensor data (which provides measures of coupling and cohesion).

Then we could do a simple case study in which we run these two measures of CRAP over a code base, order the classes in the code base by their level of crappiness according to the two measures, and ask experts to assess which ordering appears to be more consistent with the code's "True" crappiness.

I think such a study could form the basis for a really crappy B.S. or M.S. Thesis.

Friday, October 5, 2007

Fast Fourier Telemetry Transforms

Dan Port, who has a real talent for coming up with interesting Hackystat research projects, sent me the following in an email today:

Some while back I had started thinking about automated ways analyze telemetry data and it occurred to me that maybe we should be looking at independent variables other than time. That is, we seem to look at some metric vs. time for the most part. This time series view is very tough to analyze and especially automate the recognition of interesting patterns. While dozing off at a conference last week something hit me (no, not my neighbor waking me up). What if we transformed the time-series telemetry data stream into.... a frequency-series. That is, do a FFT on the data. This would be like looking at the frequency spectrum of an audio stream (although MUCH simpler).

I am extremely naive about FFT, but my sense is that this approach basically converts a telemetry stream into a 'fingerprint' which is based upon oscillations. My hypothesis is that this fingerprint could represent, in some sense, a development 'process'.

If that's so, then the next set of questions might be:
  • Is this 'process' representation stable? Would we get the same/similar FFT later or from a different group?
  • Is this process representation meaningful? Are the oscillations indicative of anything useful/important about the way people work? Does this depend upon the actual kind of telemetry data that is collected?
It would be nice to come up with some plausible ideas for telemetry streams that might exhibit 'meaningful' oscillations as a next step in investigating this idea.

Monday, October 1, 2007

The Perils Of Emma

I just had my software engineering class do reviews of each others code where they analyzed the quality of testing from both a white box and a black box perspective. (Recall that in black box testing, you look at the specifications for the system and write tests to verify it obeys the spec. In white box testing, you test based on the code itself and focus on getting high coverage. That's a bit simplified but you get the drift.)

After they did that analysis, the next thing I asked them to do was "break da buggah"---write a test case or two that would crash the system.

Finally, they had to post their reviews to their engineering logs.

Many experienced a mini-epiphany when they discovered how easy it was to break the system---even when it had high coverage. The point that this drove home (reinforcing their readings for the week that included How To Misuse Code Coverage) is that if you just write test cases to exercise your code and improve your coverage, those test cases aren't likely to result in "strong" code that works correctly under both normal and (slightly) abnormal conditions. Some programs with high coverage didn't even implement all of the required command line parameters!

Code coverage tools like Emma are dangerously addictive: they produce a number that appears to be related to code and testing quality. The reality is that writing tests merely to improve coverage can potentially be a waste of time and even counterproductive: it makes the code look like it's well tested when in fact it's not.

Write tests primarily from a black box perspective. (Test-Driven Design gets you this "for free", since you're writing the tests before the code exists for which you be computing coverage.)

When you think you've tested enough, run Emma and see what she thinks of your test suite. If the numbers aren't satisfying, you might want to close your browser window immediately and think harder about your program and the behaviors it can exhibit of which you're not yet aware. Resist the temptation to "cheat" and navigate down to find out which classes, methods, and lines of code weren't exercised. As soon as you do that, you've left the realm of black box testing and are now simply writing tests for Emma's sake.

And that's not a good place to be.