Monday, April 30, 2007

Xml Schema definition for dummies

Today I defined my first batch of Xml Schemas for Version 8. The results of my labors are now available at http://hackystat-sensorbase-uh.googlecode.com/svn/trunk/xml/schema/

For each XSD file, I also provide a couple of "example" XML files, available in http://hackystat-sensorbase-uh.googlecode.com/svn/trunk/xml/examples/

To test that the XSD validates the XML, I used the online DecisionSoft Xml Validator. Provide it with an XSD schema definition file, and an XML file to validate against it, and away it goes. The error messages were a little obtuse, but good enough for my purposes.

It's possible to include a reference to the XSD file within the XML instance, which is probably what we want to do in practice.

The next step is to parse the XML. Here's a nice example of using JAXB 2.0 to parse XML efficiently (both in terms of code size and execution time).

Saturday, April 28, 2007

Version 8 Design Progress

Lots of progress this week on the design of Version 8. There is a milestone on May 15, just over two weeks away, and I've been fleshing out a bunch of pages in order to add some details and direction as to what we might want to try to accomplish by then.

Finally, I gave a talk on REST in ICS 414 yesterday, and noticed the following blog entry by Josh about REST in general and the implications for Hackystat understandability and usability in particular. This gives me hope that we're heading in the right direction!

Wednesday, April 25, 2007

H4 with David

Had an excellent H4 session with David last week, but unfortunately spaced out blogging about it until now. We spent the time looking over his JEdit sensor and trying to figure out how to get it to be correctly noticed as a plugin by JEdit at startup time. Found one significant problem during the session (the plugin main class was not named correctly), and one more significant problem after the session (the build directory was being used to store source code in the SVN repository.)

What's the morale of this story? Basically the obvious one: two heads are better than one, and the process of explaining your code to someone else has the potential to be an excellent way to reveal issues and problems in a very cost-effective manner.

Friday, April 20, 2007

Near real time communication using Restlet

Some services (such as a UI service to watch the arrival of sensor data at a SensorBase) want "near real time" communication, using something like Jabber. There is a new project that integrates XMPP and Restlet which might be quite useful for this:

http://permalink.gmane.org/gmane.comp.java.restlet/1944

David might want to check this out.

Wednesday, April 18, 2007

Why REST for Hackystat?

Cedric asked a really good question on the hackystat mailing list today, and I thought it was worth posting to this blog:

> Probably my question is too late since you have already decide use REST, but I want to
> know the rationale behind it.
>
> Since you are still returning data in xml format, what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Excellent question! No, it's not too late at all. This is exactly the right time to be discussing this kind of thing.

It turns out that when I started the Version 8 design process, I was still thinking in terms of a monolithic server and was heading down the SOAP/WSDL route. I was, for example, investigating Glassfish as an alternative to Tomcat due to its purportedly better support for web services.

Then the Version 8 design process took an unexpected turn, and the monolithic server fragmented into a set of communicating services: SensorBase services for raw sensor data, Analysis services that would request data from SensorBases and provide higher level abstractions, and UI services that would request data from SensorBases and Analyses and display it with a user interface.

What worried me about this design initially was that every Analysis service would have to be able to both produce and consume data (kind of like being a web server and a web browser at the same time), and that Glassfish might be overkill for this situation. So, I started looking for a lightweight Java-based framework for producing/consuming web services, and came upon the Restlet Framework (http://www.restlet.org/), which then got me thinking more deeply about REST.

It's hard to quickly sum up the differences between REST and WSDL, but here's a few thoughts to get you started. WSDL is basically based upon the remote procedure call architectural style, with HTTP used as a "tunnel". As a result, you generally have a single "endpoint", or URL, such as /soap/servlet/messagerouter, that is used for all communication. Every single communication with the service, whether it is to "get" data from the service, "put" data to the service, or modify existing data is always implemented (from an HTTP perspective) in exactly the same way: an HTTP POST to a single URL. From the perspective of HTTP, the "meaning" of the request is completely opaque.

In REST, in contrast, you design your system so that your URLs actually "mean" something: they name a "resource". Furthermore, the type of HTTP method also "means" something: GET means "get" a representation of the resource named by the URL, "POST" means create a new resource which will have a unique URL as its name, DELETE means "delete" the resource named by the URL, and so forth.

For example, in Hackystat Version 7, to send sensor data to the server, we use Axis, SOAP, and WSDL to send an HTTP POST to http://hackystat.ics.hawaii.edu/hackystat/soap/rpcrouter, and the content of the message indicates that we want to create some sensor data. All sensor data, of all types, for all users, is sent to the same URL in the same way. If we wanted to enable programmatic access to sensor data in Version 7, we would tell clients to continue to use HTTP POST to http://hackystat.ics.hawaii.edu/hackystat/soap/rpcrouter, but tell them that the content of the POST could now invoke a method in the server to obtain data.

A RESTful interface does it differently: to request data, you use GET with an URL that identifies the data you want. To put data, you use POST with an URL that identifies the resource you are creating on the server. For example:

GET http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

might return the Commit sensor data with timestamp 1176759070170 for user x3fhU784vcEW. Similarly,

POST http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

would contain a payload with the actual Commit data contents that should be created on the server. And

DELETE http://hackystat.ics.hawaii.edu/hackystat/sensordata/x3fhU784vcEW/Commit/1176759070170

would delete that resource. (There are authentication issues, of course.)

In fact, REST asserts a direct correspondance between the CRUD (create/read/update/delete) DB operations and the POST, GET, PUT, and DELETE methods for resources named by URLs.

Now, why do we care? What's so good about REST anyway? In the case of Hackystat, I think there are two really significant advantages of a RESTfully designed system over an RPC/SOAP/WSDL designed system:

(1) Caching can be done by the Internet. If you obey a few more principles when designing your system, then you can use HTTP techniques as a way to cache data rather than build in your own caching system. It's exactly the same way that your browser avoids going back to Amazon to get the logo files and so forth when you move between pages. In the case of Hackystat, when someone invokes a GET on the SensorBase with a specific URL, the results can be transparently cached to speed up future GETs of the same URL, since that represents the same resource. (There are cache expiration issues, which I'm pretty sure we can deal with.)

In Hackystat Version 7, there is a huge amount of code that is devoted to caching, and this code is also a huge source of bugs and concurrency issues. With a REST architecture, it is possible that most, perhaps all, of this code can be completely eliminated without a performance hit. Indeed, performance might actually be significantly better in Version 8.

(2) A REST API is substantially more "accessible" than a WSDL API. One thing I want from Hackystat Version 8 is a substantially simpler, more accessible interface, that enables outsiders to quickly learn how to extend Hackystat for their own purposes with new services and/or extract low-level or high-level data from Hackystat for their own analyses. To do this with a RESTful API, it's straightforward: here are some URLs, here's how they translate into resources, invoke GET and you are on your way. Pretty much every programming language has library support for invoking an HTTP GET with an URL. One could expect a first semester programming student to be able to write a program to do that. Shoots, you can do it in a browser. The "barrier to entry" for this kind of API is really, really low.

Now consider a WSDL API. All of a sudden, you need to learn about SOAP, and you need to find out how to do Web Services in your chosen programming language, and you have to study the remote procedure calls that are available, and so forth. The "barrier to entry" is suddenly much higher: there are incompatible versions of SOAP, there's way more to learn, and I bet more than a few people will quickly decide to just bail and request direct access to the database, which cuts them out of 90% of the cool stuff in Hackystat.

So, from my reckoning, if we decided to use Axis/SOAP/WSDL in Version 8, we'd (1) continue to need to do all our own caching with all of the headaches that entails, and (2) we'd be stuck with a relatively complex interface to the data.

I want to emphasize that a RESTful architecture is more subtle than simply using GET, POST, PUT, and DELETE. For example, the following is probably not restful:

GET http://foo/bar/baz&action=delete

For more details, http://en.wikipedia.org/wiki/Representational_State_Transfer has a good intro with pointers to other readings.

Your email made another interesting assertion:

> what makes you decide not to publish
> a collection of WSDL and go along with more industrial standard web service calls?

Although I agree that WSDL is an "industry standard", this doesn't mean that REST isn't one as well. Indeed, my sense after a few weeks of research on the topic is that most significant industrial players have already moved to REST or offer REST as an alternative to WSDL: eBay, Google, Yahoo, Flickr, and Amazon all have REST-based services. I recall reading that the REST API gets far more traffic than the correponding WSDL API for at least some of these services.

Finally, no architecture is a silver bullet, and REST is no exception. For example, if you can't effectively model your domain as a set of resources, or if the CRUD operations aren't a good fit with the kinds of manipulations you want to do, then REST isn't right. Another REST requirement is statelessness, which can be a problem for some applications. So far in my design process, however, I haven't run into any showstoppers for the case of Hackystat.

Version 8 is still in the early stages, and the advantages of REST are still hypothetical, so I'm really happy to have this conversation. There are no hard commitments to anything yet, and if there turns out to be a showstopping problem with REST, then we can of course make a change. The more we talk about it, the greater the odds we'll figure out the right thing.

Cheers,
Philip

Monday, April 16, 2007

Version 8 now appearing on Google Projects

I am happy to announce that the Hackystat Version 8 repository is starting to take shape as a set of related projects in Google Project Hosting.

The "hub" project is hackystat which does not contain any code, but does contain wiki pages with high level planning and design documents.

The hub project also contains links to individual Google Project Hosting projects that have been set up to manage implementation of some of the initial Version 8 services. These projects are: hackystat-sensorbase-uh, hackystat-sensor-shell, hackystat-sensor-xmldata, and hackystat-ui-sensordataviewer. These projects don't actually contain any code yet, either.

Note that there are conventions for naming the Hackystat Version 8 projects.

My first focus of attention is on the SensorBase, and I am currently trying to design the REST API for that service. Details at 11.

Friday, April 13, 2007

CSDL members, please read this immediately!

Now that we're two weeks into our blog experiment, I want to do a test of how fast and effectively information is disseminated through the group through this mechanism.

If you are in CSDL, and you are reading this, please reply to this blog posting immediately with the comment "I read it.". I want to find out the following:
  • Will everyone in CSDL reply to this comment? Who is actually reading other members' blogs?
  • How long does it take for a new blog comment to be read by other members of the group?
For this test, please do not verbally inform other CSDL members of this experiment.

I will display this blog entry and discuss the results at next Wednesday's meeting.

Tuesday, April 3, 2007

Java 5 Conversion Week

Having gotten through almost all of the Core subsystem, I think it's time to try allowing all the CSDL Hackystat Hackers to have some fun with Java 5 conversion. So, I'm declaring this week "Java 5 Conversion Week", and the goal is to update the last module in the Core subsystem and all of the code in the SDT subsystem so that no warnings remain in Eclipse (using the default settting for warnings generation, which also corresponds to the Java 5 compiler warnings setting.)

The chart at left lists all of the modules to be worked on, the number of warnings currently present in each, and the developer assigned to eliminate the warnings from them. As you can see, there are about 900 warnings total, and six developers available to work on them, so that results in about 150 warnings per developer. Because it's simplest to assign the work on a module basis, I ended up giving Hongbing and I a little more to do than everyone else, but at the end of the day I don't think the differences will add up to much difference in the amount of time spent on the work.

I will create Jira issues for each developer listing the modules they are assigned and referencing this blog entry for more details. Also, please be sure to review my other blog entry on Java 5 Conversion for additional hints on how to carry out this process.

Good luck, have fun, send email if you run into problems, and feel free to send out an email when you've committed your last batch of changes!

Monday, April 2, 2007

Hackystat 8 and Net Option Values

In The Option Value of Modularity in Design, Carliss Baldwin and Kim Clark argue that appropriate modularity can impact on adoption of a system over other alternatives due to increases in the "net option value" of the design.

It occurs to me that Hackystat 8, by decomposing the current "black box" of the server into a set of loosely coupled, independently evolvable components, will enable new degrees of freedom in the design evolution of the system both within the CSDL research group and externally in the software engineering community.

Sunday, April 1, 2007

Archive TimeZone problem and its workaround

The problem we have been experiencing with the Archive listing being off from the blog entry listing is well known (and particularly prevalent in Hawaii!) See the thread here to follow the discussion and find out when the blogger folks implement the fix.

The temporary workaround is to use Pacific time, which of course makes the posting time off by a few hours. I've changed my blog to Pacific time and, voila, the problem disappears.