PMJ Engineering Log

Friday, December 14, 2007

If Collective Intelligence is the question, then dashboards are not the answer

I've been thinking recently about collective intelligence and how it applies to software engineering in general and software metrics in particular.

My initial perspective on collective intelligence is that provides an organizing principle for a process in which a small "nugget" of information is somehow made available to one or more people, who then refine it, communicate it to others, or discard it as the case may be. Collective Intelligence results when mechanisms exist to prevent these nuggets from being viewed as spam (and the people communicating them from being viewed as spammers), along with mechanisms to support the refinement and elaboration of the initial nugget of information into an actual "chunk" of insight or knowledge. Such chunks, as they grow over time, enable long-term organization learning from collective intelligence processes.

Or something like that. What strikes me about the software metrics community and literature is that when it comes to how "measures become knowledge", the most common approach seems to be:

Management hires some people to be the "Software Process Group"
The SPG goes out and measures developers and artifacts somehow.
The SPG returns to their office and generates some statistics and regression lines.
The SPG reports to management about the "best practices" they have discovered, such as "The optimal code inspection review rate is 200 LOC per hour".
Management issues a decree to developers that, from now on, they are to review code at the rate of 200 LOC per hour.

I'm not sure what to call this, but collective intelligence does not come to mind.

When we started work on Hackystat 8, it became clear that there were new opportunities to integrate our system with technologies like Google Gadgets, Twitter, Facebook, and so forth. I suspect that some of these integrations, like Google Gadgets, will turn out to be little more than neat hacks with only minor impact on the usability of the system. My conjecture is that the Hackystat killer app has nothing to do with "dashboards"; most modern metrics collection systems provide dashboards and people can ignore them just as easily as they ignore their web app predecessors.

On the other hand, integrating Hackystat with a communication facility like Twitter or with a social networking application has much more profound implications, because these integrations have the potential to create brand new ways to coordinate, communicate, and annotate an initial "nugget" generated by Hackystat into a "chunk" of knowledge of wider use to developers. It could also work the other way: an anecdotal "nugget" generated by a developer ("Hey folks, I think that we should all run verify before committing our code to reduce continuous integration build failure") could be refined into institutional knowledge (a telemetry graph showing the relationship between verify-before-commit and build success), or discarded (if the telemetry graph shows no relationship).

Thursday, December 6, 2007

Social Networks for Software Engineers

I've been thinking lately about social networks, and what kind of social network infrastructure would attract me as a software engineer. Let's assume, of course, that my development processes and products can be captured via Hackystat and made available in some form to the social network. Why would this be cool?

The first reason would be because the social network could enable improved communication and coordination by providing greater transparency into the software development process. For example:

Software project telemetry would reveal the "trajectory" of development with respect to various measures, helping to reveal potential bottlenecks and problems earlier in development.
Integration with Twitter could support automated "tweets" informing the developers when events of interest occur.
An annotated Simile/Timeline representation of the project history could help developers understand and reflect upon a project and what could be done to improve it.

I'm not sure, however, that this is enough for the average developer. Where things get more interesting is when you realize that Hackystat is capable of developing a fairly rich representation of an individual developer's skills and knowledge areas.

As a simple example, when Java programmer edits a class file, the set of import statements reveal the libraries being used in that file, and thus the libraries that this developer has some familiarity with, because he or she is using those libraries to implement the class in question. When a Java programming edits a class file, they are also using some kind of editor---Emacs, Eclipse, Idea, NetBeans, and thus revealing some level of expertise with that environment. Indeed, Hackystat sensors can not only capture knowledge like "I've used the Eclipse IDE over 500 hours during the past year", but even things like "I know how to invoke the inspector and trace through functions in Eclipse", or "I've never once used the refactoring capabilities." Of course, Hackystat sensors can also capture information about what languages you write programs in, what operating systems you are familiar with, what other development tools you know about, and so forth. Shoots, Hackystat could even keep a record of the kinds of exceptions your code has generated.

Let's assume that all of this information can be processed and made available to you as, say, a FaceBook Application. And, you can edit the automatically generated profile to remove any skills you don't want revealed. You might also be able to annotate the information to provide explanatory information. You can provide details about yourself, such as "Student" or "Professional", and also your level of "openness" to the community. After all that's done, you press "Publish" and this becomes part of your FaceBook or OpenSocial profile.

So what?

Well, how about the following scenarios:

[1] I'm a student and just encountered a weird Exception. I search the community for others with experience with this Exception. I find three people, send them an IM, and shortly thereafter one of them gives me a tip on how to debug it.

[2] I'm interested in developing a Fortress mode for Emacs, but don't want to do it alone. I search the community for developers with both expertise in Fortress and Emacs, and contact them to see if they want to work with me on such a mode.

[3] I'm an employer and am interested in developers with a demonstrated experience with compiler development for a short-term, well paying consulting position. I need people who don't require any time to come up to speed on my problem; I don't want to hire someone who took compilers 10 years ago in college and hasn't thought about it since. I search the community, and find a set of candidates who have extensive, recent experience using Lex, YACC, and JavaCC. I contact them to see if they would be interested in a consulting arrangement.

[4] I'm a student who has been participating in open source projects and making extensive contributions, but has never had a "real" job. I need a way to convince employers that I have significant experience. I provide a pointer in my resume to my profile, showing that I have thousands of hours of contributions to the Apache web server and Python language projects.

Hackystat is often thought of as a measurement system, and indeed all the above capabilities result from measurement. However, the above doesn't feel like measurement, it feels like social coordination and communication of relatively sophisticated and efficient nature.

Monday, November 5, 2007

Measurement as Mashup, Ambient Devices, Social Networks, and Hackystat

The new architecture of Hackystat has me thinking about new metaphors for software engineering measurement. Indeed, it has me wondering if where we are heading is even characterized best as "measurement" or even "software engineering". Alistair Cockburn, for example, has written an article on The End Of Software Engineering in which he challenges the use of the term "software engineering" as an appropriate description for what people do (or should do) when developing software.

Similarly, when we began work on Hackystat six years ago, I thought in fairly conventional terms about this system: it was basically a way to make it simpler to gather more accurate measures that could be used for traditional software engineering measurement activities: baselines, prediction, control, quality assessment.

One interesting and unintended side effect of the Hackystat 8 architecture, in which the system becomes a seamless component of the overall internet information ecosystem via a set of RESTful web services, is a re-examination of my fundamental conceptions of what the system could and should be. In particular, two Web 2.0 concepts: "mashup", and "social network", provide interesting metaphors.

Measurement as Mashup

Hackystat has always embraced the idea of "mashup". From the earliest days, we have pursued the hypothesis that there is a "network effect" in software process and product metrics; that the more orthogonal measures you could gather about a system, the more potential you would gain for insight as you obtained the ability to compare and contrast them. Thus, we created a system that was easily extensible with sensors for different tools that could gather data of different types.

Software Project Telemetry is an early result of our search for ways to obtain meaning within this network effect. In Software Project Telemetry, we created a language that enables users to easily create "measurement mashups" consisting of metrics and their trends over time. The following screen image shows an example mashup, in which we opportunistically discovered a covariance between build failures and code churn over time for a specific project :

Hackystat 8 creates new opportunities for mashups, because we can now integrate this kind of data with other data collection and visualization systems. As one example, we are exploring the use of Twitter as a data collection and communication mechanism. Several members of the Hackystat 8 development group "follow" each other with Twitter and post information about their software development activities (among other things) as a way to increase awareness of project state. Here's a recent screen image show some of our posts:

There are at least two interesting directions for Twitter/Hackystat mashups. Assuming that members of a project team are twitter-enabled, we can provide a Hackystat service that monitors the data being collected from sensors and sends periodic "tweets" that answer the question "What are you doing now?" for individual developers and/or the project as a whole. Going the other direction, we can gather "tweets" as data that we can display on a Simile/Timeline with our metrics values, which provides an interesting approach to integrating qualitative and quantitative data.

A second form of mashup is the use of internet-enabled ambient devices such as Ambient Orbs or Nabaztag. The idea here is to get away from the use of the browser (or even the monitor) as the sole interface to Hackystat information and analyses. Instead, we could move toward Mark Weiser's vision of calm technology, or ""that which informs but doesn't demand our focus or attention".

The net of all this is that Hackystat is evolving from a kind of "local" capability for mashups represented by software project telemetry to a new "global" capability for mashups in which Hackystat can act as a first class citizen of the internet information infrastructure.

Software development as social network

Google is releasing an API for social networking called OpenSocial. This API essentially enables you to (a) create profiles of users; (b) publish and subscribe to events, and (c) maintain persistent data. You can use Google Gears to maintain data on client computers, and thus create more scalable systems. Google intends this as a way for developers to create third party applications that can run within multiple social networks (MySpace, Orkut), as well as enable users to maintain, transfer, and/or integrate data across these networks.

So. What would Hackystat look like, and what would it do, if it was implemented using OpenSocial?

First, I think that in contrast to the current analysis focus of Hackystat, in which the concept of a "project" as an organizing principle is very important, in an OpenSocial world you might not be so interested in a project-based orientation for analyses. Instead, I think the emphasis would be much more on the individual and their behaviors across, and independent of, projects.

For example, your Hackystat OpenSocial "profile" might include analysis results like: "I worked for three hours hacking Java code yesterday", or "I have a lot of experience with the Java 2D API", or "I use test driven design practices 80% of the time". All of these might be interesting to others in your social network as a representation of what you are doing currently and/or are capable of doing in future. The process/product characteristics of the projects that you work on might be less important in an OpenSocial profile for, I think, two reasons: (a) it is harder to understand the individual's contributions in the context of project-level analyses; and (b) project data might "give away" information that the employer of the developer does not want published.

Which brings me to a second conjecture: issues of data privacy or "sanitization" will become much more important for social network software engineering using a system like OpenSocial. To make the example analyses I listed above, it must be possible to collect detailed data about your activities as a developer (sufficient, for example, to infer TDD behaviors), yet publish them at an abstract enough level that no proprietary information is being revealed. That is a fascinating trade-off that will require a great deal of study and research. The implications are both technical and social.

Monday, October 29, 2007

The Mismeasurement of Science

Peter Lawrence has written an interesting article on the (mis)use of measurement to assess "quality" and/or "impact" of scientists. It's called The Mismeasurement of Science, and appeared in Current Biology, August 7, 2007: 17 (15), r583. You can download it here.

Highly recommended reading, not only for scientists, but also as another interesting example of how a simple-minded approach to measuring "quality" or "productivity" has a wide range of dysfunctional implications. I particularly liked the following:

The measures seemed, at first rather harmless, but, like cuckoos in a nest, they have grown into monsters that threaten science itself. Already, they have produced an “audit society” in which scientists aim, and indeed are forced, to put meeting the measures above trying to understand nature and disease.

I suspect that similarly simple minded application of software engineering measures (such as Active Time in Hackystat) would have similarly disastrous consequences were anyone to actually take them seriously.

Sunday, October 7, 2007

Hackystat and Crap4J

The folks at Agitar, who clearly have a sense of humor in addition to being excellent hackers, have recently produced a plug-in for Eclipse called Crap4J that calculates a measure of your code's "crappiness".

From their web page:

There is no fool-proof, 100% objective and accurate way to determine if a particular piece of code is crappy or not. However, our intuition – backed by research and empirical evidence – is that unnecessarily complex and convoluted code, written by someone else, is the code most likely to elicit a “This is crap!” response. If the person looking at the code is also responsible for maintaining it going forward, the response typically changes into “Oh crap!”

Since writing automated tests (e.g., using JUnit) for complex code is particularly hard to do, crappy code usually comes with few, if any, automated tests. The presence of automated tests implies not only some degree of testability (which in turn seems to be associated with better, or more thoughtful, design), but it also means that the developers cared enough and had enough time to write tests – which is a good sign for the people inheriting the code.

Because the combination of complexity and lack of tests appear to be good indicators of code that is potentially crappy – and a maintenance challenge – my Agitar Labs colleague Bob Evans and I have been experimenting with a metric based on those two measurements. The Change Risk Analysis and Prediction (CRAP) score uses cyclomatic complexity and code coverage from automated tests to help estimate the effort and risk associated with maintaining legacy code. We started working on an open-source experimental tool called “crap4j” that calculates the CRAP score for Java code. We need more experience and time to fine tune it, but the initial results are encouraging and we have started to experiment with it in-house.

Here's a screenshot of Crap4J after a run over the SensorShell service code:

Immediately after sending this link to the Hackystat Hackers, a few of us started playing with it. While the metric seems intuitively appealing (and requires one to use lots of bad puns when reporting on the results), its implementation as an Eclipse plugin is quite limiting. We have found, for example, that the plugin fails on the SensorBase code, not through any fault of the SensorBase code (whose unit tests run quite happily within Eclipse and Ant) but seemingly because of some interaction with Agitar's auto-test invocation or coverage mechanism.

Thus, this seems like an opportunity for Hackystat. If we implement the CRAP metric as a higher level analysis (for example, at the Daily Project Data level), then any combination of tools that send Coverage data and FileMetric data (that provides cyclomatic complexity) can produce CRAP. Hackystat can thus measure CRAP independently of Eclipse or even Java.

The Agitar folks go on to say:

We are also aware that the CRAP formula doesn’t currently take into account higher-order, more design-oriented metrics that are relevant to maintainability (such as cohesion and coupling).

Here is another opportunity for Hackystat: it would be trivial, once we have a DPD analysis that produces CRAP, to provide a variant CRAP calculation that factors in Dependency sensor data (which provides measures of coupling and cohesion).

Then we could do a simple case study in which we run these two measures of CRAP over a code base, order the classes in the code base by their level of crappiness according to the two measures, and ask experts to assess which ordering appears to be more consistent with the code's "True" crappiness.

I think such a study could form the basis for a really crappy B.S. or M.S. Thesis.

Friday, October 5, 2007

Fast Fourier Telemetry Transforms

Dan Port, who has a real talent for coming up with interesting Hackystat research projects, sent me the following in an email today:

Some while back I had started thinking about automated ways analyze telemetry data and it occurred to me that maybe we should be looking at independent variables other than time. That is, we seem to look at some metric vs. time for the most part. This time series view is very tough to analyze and especially automate the recognition of interesting patterns. While dozing off at a conference last week something hit me (no, not my neighbor waking me up). What if we transformed the time-series telemetry data stream into.... a frequency-series. That is, do a FFT on the data. This would be like looking at the frequency spectrum of an audio stream (although MUCH simpler).

I am extremely naive about FFT, but my sense is that this approach basically converts a telemetry stream into a 'fingerprint' which is based upon oscillations. My hypothesis is that this fingerprint could represent, in some sense, a development 'process'.

If that's so, then the next set of questions might be:

Is this 'process' representation stable? Would we get the same/similar FFT later or from a different group?
Is this process representation meaningful? Are the oscillations indicative of anything useful/important about the way people work? Does this depend upon the actual kind of telemetry data that is collected?

It would be nice to come up with some plausible ideas for telemetry streams that might exhibit 'meaningful' oscillations as a next step in investigating this idea.

Monday, October 1, 2007

The Perils Of Emma

I just had my software engineering class do reviews of each others code where they analyzed the quality of testing from both a white box and a black box perspective. (Recall that in black box testing, you look at the specifications for the system and write tests to verify it obeys the spec. In white box testing, you test based on the code itself and focus on getting high coverage. That's a bit simplified but you get the drift.)

After they did that analysis, the next thing I asked them to do was "break da buggah"---write a test case or two that would crash the system.

Finally, they had to post their reviews to their engineering logs.

Many experienced a mini-epiphany when they discovered how easy it was to break the system---even when it had high coverage. The point that this drove home (reinforcing their readings for the week that included How To Misuse Code Coverage) is that if you just write test cases to exercise your code and improve your coverage, those test cases aren't likely to result in "strong" code that works correctly under both normal and (slightly) abnormal conditions. Some programs with high coverage didn't even implement all of the required command line parameters!

Code coverage tools like Emma are dangerously addictive: they produce a number that appears to be related to code and testing quality. The reality is that writing tests merely to improve coverage can potentially be a waste of time and even counterproductive: it makes the code look like it's well tested when in fact it's not.

Write tests primarily from a black box perspective. (Test-Driven Design gets you this "for free", since you're writing the tests before the code exists for which you be computing coverage.)

When you think you've tested enough, run Emma and see what she thinks of your test suite. If the numbers aren't satisfying, you might want to close your browser window immediately and think harder about your program and the behaviors it can exhibit of which you're not yet aware. Resist the temptation to "cheat" and navigate down to find out which classes, methods, and lines of code weren't exercised. As soon as you do that, you've left the realm of black box testing and are now simply writing tests for Emma's sake.

And that's not a good place to be.