PMJ Engineering Log: Version8

Showing posts with label Version8. Show all posts

Tuesday, May 8, 2007

JAXB for Hackystat for Dummies

I spent today working through the XML/Java conversion process for SensorBase resources, and it occurred to me near the end that my struggles could significantly shorten the learning curve for others writing higher level services that consume SensorBase data (such as the UI services being built by Alexey, Pavel, and David.)

So, I did a quick writeup on the approach, in which I refer to a library jar file I have made available as the first SensorBase download.

After so many years using JDOM, which was nice in its own way, it is great to move onward to an even faster, simpler, and easier approach.

Friday, May 4, 2007

SensorBase coding has begun!

To my great delight (given that the May 15 milestone is rapidly approaching) I have committed my first bit of SensorBase code today.

Some interesting tidbits:

First, I am continuing to observe the Hackystat tradition of always including a reference to an Issue in the SVN commit message. In this case, the reference looks like:

http://code.google.com/p/hackystat-sensorbase-uh/issues/detail?id=3

Second, to my surprise, I am coding 100% in a TDD style, not out of any philosophical commitment or moral imperative, but simply out of the sense that this is just the most natural way to start to get some leverage on the SensorBase implementation. The REST API specification turns out to form a very nice specification of the target behavior, and so I just picked the first URI in the table (GET host/hackystat/sensordatatypes) which is supposed to return a list of sensordatatype resource references, and wrote a unit test which tries that call on a server. Of course, the test fails, because I haven't written the server yet.

Third, to my relief, the Restlet framework makes that test case wicked easy to write. In fact, here it is:


@Test public void getSdtIndex () {
 // Set up the call.
 Method method = Method.GET;
 Reference reference = new Reference("http://localhost:9090/hackystat/sensordatatypes");
 Request request = new Request(method, reference);

 // Make the call.
 Client client = new Client(Protocol.HTTP);
 Response response = client.handle(request);

 // Test that the request was received and processed by the server OK.
 assertTrue("Testing for successful status", response.getStatus().isSuccess());

 // Now test that the response is OK.
 XmlRepresentation data = response.getEntityAsSax();
 assertEquals("Checking SDT", "SampleSDT", data.getText("SensorDataTypes/SensorDataType/@Name"));
 }

There's a couple of rough edges (I can't hard code the server URI, and my XPath is probably bogus), but the code runs and does the right thing (i.e. fails at the getStatus call with a connection not found error.)

I'm sure things won't be this easy forever, but it's nice to get off to a smooth start.

Thursday, May 3, 2007

Minimize library imports to your Google Projects

As we transition to Google Project Hosting, one thing we need to be particularly careful about is uploading of third party libraries into SVN. In general, try to avoid doing this. There are two reasons for this. First, there is limited disk space in Google Project Hosting, and its easy to burn up your space with libraries (remember that since SVN never deletes, libraries that need updating frequently will burn through your space quickly.) Second, different services will often share the same library. For example, most of our Java-based services will probably want to use the Restlet framework. It is generally better to install that in one place as a developer.

To avoid uploading libraries to SVN, you can generally do one of the following alternatives:

Instruct your developers in the installation guide to download the library to a local directory, create an environment variable called {LIBRARY}_HOME, and point to those jar files from your Eclipse classpath or Ant environment variable.
For files that need to be in a specific location in your project, such as GWT, download the GWT to a local directory, then copy the relevant subdirectories into your project.

Binary distributions of releases is a different situation. In that case, we will typically want to bundle the libraries into the binary distribution. That will cause its own difficulties, since Google Project Hosting limits us to 10MB files for the download section, but we'll cross that bridge when we come to it.

Monday, April 30, 2007

Xml Schema definition for dummies

Today I defined my first batch of Xml Schemas for Version 8. The results of my labors are now available at http://hackystat-sensorbase-uh.googlecode.com/svn/trunk/xml/schema/

For each XSD file, I also provide a couple of "example" XML files, available in http://hackystat-sensorbase-uh.googlecode.com/svn/trunk/xml/examples/

To test that the XSD validates the XML, I used the online DecisionSoft Xml Validator. Provide it with an XSD schema definition file, and an XML file to validate against it, and away it goes. The error messages were a little obtuse, but good enough for my purposes.

It's possible to include a reference to the XSD file within the XML instance, which is probably what we want to do in practice.

The next step is to parse the XML. Here's a nice example of using JAXB 2.0 to parse XML efficiently (both in terms of code size and execution time).

Saturday, April 28, 2007

Version 8 Design Progress

Lots of progress this week on the design of Version 8. There is a milestone on May 15, just over two weeks away, and I've been fleshing out a bunch of pages in order to add some details and direction as to what we might want to try to accomplish by then.

I've completed an initial draft of the SensorBase REST API specification. This is currently out for review.
I've updated the home pages for the TelemetryViewer and SensorDataViewer services with more information about what these services should accomplish.
I've created the SensorBase Version8 Schedule, TelemetryViewer Version8 Schedule, and the SensorDataViewer Version8 Schedule pages. These pages provide more detail on what I hope are reasonable expectations for the May 15 milestone.

Finally, I gave a talk on REST in ICS 414 yesterday, and noticed the following blog entry by Josh about REST in general and the implications for Hackystat understandability and usability in particular. This gives me hope that we're heading in the right direction!

Friday, April 20, 2007

Near real time communication using Restlet

Some services (such as a UI service to watch the arrival of sensor data at a SensorBase) want "near real time" communication, using something like Jabber. There is a new project that integrates XMPP and Restlet which might be quite useful for this:

http://permalink.gmane.org/gmane.comp.java.restlet/1944

David might want to check this out.

Wednesday, April 18, 2007

Why REST for Hackystat?

Cedric asked a really good question on the hackystat mailing list today, and I thought it was worth posting to this blog:

> Probably my question is too late since you have already decide use REST, but I want to
> know the rationale behind it.
>
> Since you are still returning data in xml format, what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Excellent question! No, it's not too late at all. This is exactly the right time to be discussing this kind of thing.

It turns out that when I started the Version 8 design process, I was still thinking in terms of a monolithic server and was heading down the SOAP/WSDL route. I was, for example, investigating Glassfish as an alternative to Tomcat due to its purportedly better support for web services.

Then the Version 8 design process took an unexpected turn, and the monolithic server fragmented into a set of communicating services: SensorBase services for raw sensor data, Analysis services that would request data from SensorBases and provide higher level abstractions, and UI services that would request data from SensorBases and Analyses and display it with a user interface.

What worried me about this design initially was that every Analysis service would have to be able to both produce and consume data (kind of like being a web server and a web browser at the same time), and that Glassfish might be overkill for this situation. So, I started looking for a lightweight Java-based framework for producing/consuming web services, and came upon the Restlet Framework (http://www.restlet.org/), which then got me thinking more deeply about REST.

It's hard to quickly sum up the differences between REST and WSDL, but here's a few thoughts to get you started. WSDL is basically based upon the remote procedure call architectural style, with HTTP used as a "tunnel". As a result, you generally have a single "endpoint", or URL, such as /soap/servlet/messagerouter, that is used for all communication. Every single communication with the service, whether it is to "get" data from the service, "put" data to the service, or modify existing data is always implemented (from an HTTP perspective) in exactly the same way: an HTTP POST to a single URL. From the perspective of HTTP, the "meaning" of the request is completely opaque.

In REST, in contrast, you design your system so that your URLs actually "mean" something: they name a "resource". Furthermore, the type of HTTP method also "means" something: GET means "get" a representation of the resource named by the URL, "POST" means create a new resource which will have a unique URL as its name, DELETE means "delete" the resource named by the URL, and so forth.

For example, in Hackystat Version 7, to send sensor data to the server, we use Axis, SOAP, and WSDL to send an HTTP POST to http://hackystat.ics.hawaii.edu/hackystat/soap/rpcrouter, and the content of the message indicates that we want to create some sensor data. All sensor data, of all types, for all users, is sent to the same URL in the same way. If we wanted to enable programmatic access to sensor data in Version 7, we would tell clients to continue to use HTTP POST to http://hackystat.ics.hawaii.edu/hackystat/soap/rpcrouter, but tell them that the content of the POST could now invoke a method in the server to obtain data.

A RESTful interface does it differently: to request data, you use GET with an URL that identifies the data you want. To put data, you use POST with an URL that identifies the resource you are creating on the server. For example:

GET http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

might return the Commit sensor data with timestamp 1176759070170 for user x3fhU784vcEW. Similarly,

POST http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

would contain a payload with the actual Commit data contents that should be created on the server. And

DELETE http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

would delete that resource. (There are authentication issues, of course.)

In fact, REST asserts a direct correspondance between the CRUD (create/read/update/delete) DB operations and the POST, GET, PUT, and DELETE methods for resources named by URLs.

Now, why do we care? What's so good about REST anyway? In the case of Hackystat, I think there are two really significant advantages of a RESTfully designed system over an RPC/SOAP/WSDL designed system:

(1) Caching can be done by the Internet. If you obey a few more principles when designing your system, then you can use HTTP techniques as a way to cache data rather than build in your own caching system. It's exactly the same way that your browser avoids going back to Amazon to get the logo files and so forth when you move between pages. In the case of Hackystat, when someone invokes a GET on the SensorBase with a specific URL, the results can be transparently cached to speed up future GETs of the same URL, since that represents the same resource. (There are cache expiration issues, which I'm pretty sure we can deal with.)

In Hackystat Version 7, there is a huge amount of code that is devoted to caching, and this code is also a huge source of bugs and concurrency issues. With a REST architecture, it is possible that most, perhaps all, of this code can be completely eliminated without a performance hit. Indeed, performance might actually be significantly better in Version 8.

(2) A REST API is substantially more "accessible" than a WSDL API. One thing I want from Hackystat Version 8 is a substantially simpler, more accessible interface, that enables outsiders to quickly learn how to extend Hackystat for their own purposes with new services and/or extract low-level or high-level data from Hackystat for their own analyses. To do this with a RESTful API, it's straightforward: here are some URLs, here's how they translate into resources, invoke GET and you are on your way. Pretty much every programming language has library support for invoking an HTTP GET with an URL. One could expect a first semester programming student to be able to write a program to do that. Shoots, you can do it in a browser. The "barrier to entry" for this kind of API is really, really low.

Now consider a WSDL API. All of a sudden, you need to learn about SOAP, and you need to find out how to do Web Services in your chosen programming language, and you have to study the remote procedure calls that are available, and so forth. The "barrier to entry" is suddenly much higher: there are incompatible versions of SOAP, there's way more to learn, and I bet more than a few people will quickly decide to just bail and request direct access to the database, which cuts them out of 90% of the cool stuff in Hackystat.

So, from my reckoning, if we decided to use Axis/SOAP/WSDL in Version 8, we'd (1) continue to need to do all our own caching with all of the headaches that entails, and (2) we'd be stuck with a relatively complex interface to the data.

I want to emphasize that a RESTful architecture is more subtle than simply using GET, POST, PUT, and DELETE. For example, the following is probably not restful:

GET http://foo/bar/baz&action=delete

For more details, http://en.wikipedia.org/wiki/Representational_State_Transfer has a good intro with pointers to other readings.

Your email made another interesting assertion:

> what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Although I agree that WSDL is an "industry standard", this doesn't mean that REST isn't one as well. Indeed, my sense after a few weeks of research on the topic is that most significant industrial players have already moved to REST or offer REST as an alternative to WSDL: eBay, Google, Yahoo, Flickr, and Amazon all have REST-based services. I recall reading that the REST API gets far more traffic than the correponding WSDL API for at least some of these services.

Finally, no architecture is a silver bullet, and REST is no exception. For example, if you can't effectively model your domain as a set of resources, or if the CRUD operations aren't a good fit with the kinds of manipulations you want to do, then REST isn't right. Another REST requirement is statelessness, which can be a problem for some applications. So far in my design process, however, I haven't run into any showstoppers for the case of Hackystat.

Version 8 is still in the early stages, and the advantages of REST are still hypothetical, so I'm really happy to have this conversation. There are no hard commitments to anything yet, and if there turns out to be a showstopping problem with REST, then we can of course make a change. The more we talk about it, the greater the odds we'll figure out the right thing.

Cheers,
Philip

Monday, April 16, 2007

Version 8 now appearing on Google Projects

I am happy to announce that the Hackystat Version 8 repository is starting to take shape as a set of related projects in Google Project Hosting.

The "hub" project is hackystat which does not contain any code, but does contain wiki pages with high level planning and design documents.

The hub project also contains links to individual Google Project Hosting projects that have been set up to manage implementation of some of the initial Version 8 services. These projects are: hackystat-sensorbase-uh, hackystat-sensor-shell, hackystat-sensor-xmldata, and hackystat-ui-sensordataviewer. These projects don't actually contain any code yet, either.

Note that there are conventions for naming the Hackystat Version 8 projects.

My first focus of attention is on the SensorBase, and I am currently trying to design the REST API for that service. Details at 11.

Monday, April 2, 2007

Hackystat 8 and Net Option Values

In The Option Value of Modularity in Design, Carliss Baldwin and Kim Clark argue that appropriate modularity can impact on adoption of a system over other alternatives due to increases in the "net option value" of the design.

It occurs to me that Hackystat 8, by decomposing the current "black box" of the server into a set of loosely coupled, independently evolvable components, will enable new degrees of freedom in the design evolution of the system both within the CSDL research group and externally in the software engineering community.

Thursday, March 29, 2007

Hackystat UI: Swivel Google Gadgets

Swivel is a site where users can upload data sets and combine contributed data sets in various ways.

What I discovered today is their interface to the Google Home Page via Google Gadgets. I think this is a nice example of how simple it could be to provide a Version 8 Hackystat user interface via Google Gadgets.

Check it out here.

Wednesday, March 28, 2007

REST and web services

Some useful links to understand Representational State Transfer:

How I explained REST to my wife.... Very nice introductory explanation. Read the comments for a lengthy debate about whether it's sexist or not.
Building Web Services the REST way. Once you get the vibe of REST, this shows how a web service should interact with the client in order to preserve RESTfulness.
Erik Wilde's nice powerpoint presentation on REST.
Wikipedia entry. Additional links and resources.

REST seems like an appropriate architectural style for the Version 8 web service component.

The framework I am most interested in evaluating for Java-based REST components is Restlet.

Tuesday, March 27, 2007

Hackystat UI: Wesabe and Social Software Metrics

Robert recently pointed me to Wesabe, which is a social networking site focusing on personal finances. This is an interesting site to compare/contrast with Hackystat, since it:

Deals with numbers and "metrics".
Requires members to share aspects of very personal information (finances) in order to exploit the potential of social networks.

Their help guide is in the form of YouTube videos, which is a little weird (or maybe the wave of the future). I will show some blurry screen shots to illustrate some of the interesting aspects of this tool. This first one shows the top-level organization of your Wesabe account, which has three tags: Accounts, Tips, and Goals.

Accounts basically corresponds to your "raw sensor data" in Hackystat. In Wesabe, you are expected to upload your bank and credit card information.

Tips correspond to information supplied by other users based upon analysis of your account data. The idea is that the raw financial data is parsed to find out what you spend money on.

For example, if you have gas charges, then you will be hooked up with Tips on how to save money on gas. They use a keyword-based mechanism to hook together account data with tip data.

The tips could be generic (don't buy premium gas) or more specific (Don't buy gas from the gas station you're going to; they are a rip-off).

Here's a screen shot of a drill-down into an account, along with the tips and keywords associated with it. Often, you will need to manually annotate your raw financial data in order for Wesabe to start to work its magic on it. You can also see from this screen shot that individual financial items can be rated, and you can also see whether other Wesabeans have recorded a similar kind of purchase.

While tips are a kind of "bottom up" mechanism for producing "actionable" information from your raw account data, goals are more of a top-down approach, in which you first specify your high level goal, and then you get hooked up with other users interested in the same approach.

In Wesabe, it seems that the main focus is to direct you into existing discussion forums rather than explicitly connect you to your financial data. For example, a goal would be something like "Start a College Savings fund for my kids.

So, how does this all relate to Hackystat? I think there are some intriguing possibilities. First, Hackystat currently allows data to be "shared" only within the context of a Project---if there are multiple members of a Project, then they can potentially see each other's data. Wesabe illustrates how you might think about "sharing" on a more global level. The idea is that you don't share the actual financial information: no one knows where you shopped or how much spent, but via the social bookmarking mechanism, the system can hook you up with "tips" (specific actionable items) or "goals" (a community of people with the same intentions).

To explore how this might work, let's imagine some possible "Tips" from the realm of Java software development:

How to convert from Java 1.4 to Generics in Java 5 (See my previous blog posting on this.)
Proper use of concurrency mechanisms.
Diagnosing a null pointer problem.

Hmm. These tips all seem to require more context than is typically provided by Hackystat data. One could image a sensor data type that provides data on the import statements associated with a file you are editing. That would give some insight into the kinds of libraries you are using, which might enable you to be hooked up with helpful tips. Another sensor data type might provide the stack trace and error message associated with a thrown exception.

Now let's think of possible "Goals"

Reduce the number of daily build failures.
Reduce the time required for running all unit tests.
Improve the quality of code.
Improve the scalability of the database.

Some of these might be inferable from the kinds of telemetry charts you are monitoring, for example.

In any case, Wesabe indicates an interesting research direction for Hackystat: create the capability for users to add keywords to their data, and then process these keywords as a way to hook users with common interests and mutually useful skills with each other.

Hackystat UI : Telemetry and Alexa charts

Alexa is a site that provides information relating to site traffic. Hongbing sent out a link recently to this site with a query as to how they could produce PNG charts so quickly. I assume that with enough CPU and network bandwidth, anything is possible. What I was personally struck by is their user interface, which rather elegantly supports a lot of the features we want from telemetry. Consider the following screen image from their site, which I have annotated with a 1, 2, 3, and 4.

UI Feature (1) is the tabs, which provide various perspectives on the set of sites. From the Telemetry perspective, this is analogous to a set of related Charts. Thus, what they've done is provided the equivalent of the Telemetry Report interface, but in a much nicer package. Instead of scrolling down through an endless series of charts, you click on a tab to see the related chart. Much, much nicer.

UI Feature (2) is the "Range" selection. This is analogous to our "Day", "Week", "Month" interval selection mechanism. While it is not as flexible as ours, it provides easier access to common interval requests: Last 7 days, Last 1 Month, Last 3 months, etc.

UI Feature (3) is the "See Traffic Details". This is analogous to our "Daily Project Summary" drilldown (or maybe a "Project Status To Date" analysis.

UI Feature (4) is the ability to easily add and subtract different trend lines. This is interesting when translated to the Telemetry domain, because it could be interpreted in one of two different ways: (a) add/subtract one or more telemetry streams, or (b) add/subtract one or more Projects. Indeed, we might want to think of providing both abilities: you could specify what telemetry streams you want to display, and then this set of telemetry streams would be specified for each Project you specify. Thus the number of lines appearing on the chart would be the number of streams times the number of projects. In most cases, you will probably want to have one stream and multiple Projects, or multiple streams for one Project.

PMJ Engineering Log