Reviewing Google AppEngine for Java (Part 2)

In the first part I’ve left-off with some good news: successful deployment in the local GAE container. In this second part I’ll talk about the following:

– Loading data and browsing
– Table indexing
– Limitations of datastore queries
– More datastore limitations and JPA issues
– Deployment
– Performance
– Production monitoring
– Usage and quotas
– Other limitations
– Final thoughts

Loading data and browsing
After finishing-off the first successful deployment, next on the agenda was testing out the persistence tier but for this I needed some data to work with in the GAE datastore.
Along with the local container Google makes available a local datastore but so far the only way to populate the datastore is described here and it uses python. Furthermore, currently there is one documented way to browse the local datastore using the local development console but this again requires the python environment. There’s a hint from Google though that there will be a data viewer for the Java SDK. In the mean time I voted up the feature request.

Back to bulk loading, after voting on the feature request to have a bulk uploader, I decided not to use the python solution but to handcraft a loader that will fill-in two tables: COUNTRY and COUNTRY_STATE (for federated countries like US and CAN). Since there’s no way to schedule worker threads (but is on it’s way) the only mechanism to trigger the data loading is via URLs. I’m using Spring 3.0 and so it wasn’t hard to create a new @Controller, some methods to handle data loading then map them to URLs:

@Controller
public class BulkLoadController {
	@RequestMapping("/importCountries")
	public ModelAndView importCountries() {
           ...
        }
	@RequestMapping("/importProvinces")
	public ModelAndView importProvinces() {
           ...
        }
}

Since my SQL script that creates and loads the COUNTRY table is no good (Google’s BigTable is not RDBMS) I chose to translate all INSERT operations to CSV.

Before
INSERT INTO COUNTRY (NAME, TWO_L_CODE, THREE_L_CODE, CODE, TIMEZONE)
  VALUES("Antarctica","AQ","ATA",10,12);

After
Antarctica,AQ,ATA,10,12

You don’t need to create the table, the GAE container runtime takes care of that both locally and in the hosted environment.
Once I created the CSV file (I named it countries.txt), I made it available to the webapp by placing it in the root of the war folder as the container has support for reading files from the application folders. So the country importer method will look like this (exception handling removed for brevity):

@RequestMapping("/importCountries")
public ModelAndView importCountries() {
	FileInputStream countriesF = new FileInputStream("countries.txt");
	BufferedReader br = new BufferedReader(new InputStreamReader(countriesF));
	String line = null;
	while((line = br.readLine()) != null) {
		persistCountry(line);
	}
        return new ModelAndView("success");
}

The only thing I had to do was to implement persistCountry which I did by simply adding a new method on the DAO that would persist a Country entity and injecting the DAO in BulkLoadController. I fired the URL request and I got the OK page back. Since I couldn’t browse the datastore I resorted to looking in the local_db. file located in WEB-INF/appengine-generated folder and I could see the country names.
Handling Transactions
One notable mention regarding transactions is that you can’t save multiple objects in a single transaction if they are all root entities (see comment). I bumped into this while sorting the following exception:

Caused by: java.lang.IllegalArgumentException: can't operate on multiple entity groups in a single transaction. found
 both Element {
  type: "stations"
  id: 500
}
 and Element {
  type: "stations"
  id: 599
}

To relief the database from unneeded burden, I’d batch multiple entity updates then issue a single transaction commit. With GAE’s datastore this is not possible so I had to open and commit transactions for every entity update. This resulted in notably high CPU usage both locally and on the hosted environment but I couldn’t find any better way of handling this. In my case I had to remove any batching logic from my respective classes and annotate the methods responsible with insert/updates (like persistCountry’) with @org.springframework.transaction.annotation.Transactional like in the ‘persistCountry’ method above or use programmatic transaction management for uploading weather stations as we’ll see immediately.

There are two more tables I still had to bulk-provision: PROVINCE and STATION but the pattern should the same. So it seemed.
When I tried to provision the PROVINCE table I got no errors but no rows in the database either. My class looks like this:

@Entity
public class Province implements Serializable {

  private ProvincePK countryCodePK;
  private String provinceName;
  private float timeZone;

  @EmbeddedId
  public ProvincePK getProvincePK() {
    return provincePK;
  }
...

@Embeddable
public class ProvincePK implements Serializable {

  private String stateCode;
  private String countryCode;
...

From the documentation [via groups] I learned that the composite PKs are not supported. However it would have been useful to indicate this either in the enhancement step (still hate it) or deployment. So issue 1658 was born. Vote it if you want it implemented.
The solution is not easy in most cases but I reverted to a single field PK in the Province entity since it was possible in my case. I had the provinces imported after this fix.
Now the harder part. I am using ASCII resources from the internet to import weather stations; In all there are about 20000 stations imported from several resources, both HTTP and FTP. The first thing I learned was that the AppEngine has a request timeout of 30 seconds. I learned this the hard way after trying to perform the same import I’m using in Spincloud only I was using a URL request to trigger it (I’m normally using JMX but it’s not supported by the hosted AppEngine). The error I bumped into was:

com.google.apphosting.runtime.HardDeadlineExceededError: This request (8c8ca701dccad394)
 started at 2009/06/05 07:30:05.088 UTC and was still executing at 2009/06/05 07:30:33.944 UTC.

Notice the timeline, there is an automatic timeout enforced by the hosted runtime.
Since we can’t use long running tasks, the alternative is to chain requests that will incrementally add batches of items into the database and we’ll use a request parameter to control batch increments.

	@RequestMapping("/updateWeatherStations")
	public ModelAndView updateWeatherStations(HttpServletRequest request) {
		int position = Integer.parseInt(request.getParameter("position"));
		wmoStationsImporter.saveWeatherStations(position);
	}

The method saveWeatherStations will save a set of items into the datastore; the ‘position’ parameter indicates the start index of the item list; the batch size is constant and so with each request the ‘position’ parameter will have to increment with the batch size.

private static final int BATCH_SIZE = 50;
public void fetchWMOData(int position) {
	int recordNo = 0;
	int startIndex = position * BATCH_SIZE;
	String nextLine = null;
	BufferedReader isr = new BufferedReader(new InputStreamReader(getStationsFileInputStream()));
	String nextLine = null;
	while ((nextLine = isr.readLine()) != null) {
		recordNo++;
		if (recordNo  startIndex + BATCH_SIZE) {
			logger.info("Finished batch");
			return;
		}
		//parse record and save weather station
	}
}

Some observations with the above code: first, I’m reading a local ASCII file that contains stations I want to save into the datastore. Secondly I found a reasonable value for the BATCH_SIZE; the higher the number, the more time the processing takes (reading from file, parsing record and saving entity into the datastore) and you may hit again the 30 seconds limit. I’m also using local files to load the ASCII data since the request limit is so tight and I don’t want to wait for resources to be downloaded off the internet for most part.

What’s left is to “weld” requests together using a shell script. The number of requests is the total number of records divided by BATCH_SIZE.

#/bin/sh
count=25
while [ "$count" -le 133 ]
do
  wget "http:///updateStations?position=$count"
  let "count += 1"
  sleep 8
done

To play nicely with the hosted environment and not to get throttled, I’m pausing 8 seconds after each request.
In conclusion I really need a better way to handle bulk loading. Best is to be able to use SQL data dumps to load the datastore.

Table indexing
The datastore indexes each column so you don’t need to do it explicitly. However, for all JPA queries that use more than one field in the criteria, composite field indexing is needed. The local environment will create an datastore index config file for you (WEB-INF/datastore-indexes.xml) and moreover, it will create the indexing configuration but only after actually executing a query at runtime. This is not optimal and I’d like to see a way to auto-generate the indexes by processing JPA queries (as part of the enhance step maybe?). More about the GAE’s datastore indexing here
Spatial indexing is not supported.

Limitations of datastore queries
If you’re thinking of a straight migration of all existing JPA queries to the datastore be warned. GAE is using a datastore based on the BigTable storage system. BigTable follows a different philosophy than traditional RDBMS and so it doesn’t support many of the standard relational aspects of a database. Here are the documented limitations.I have found that queries that use UPPER and LIKE are not supported either.

org.datanucleus.store.appengine.query.DatastoreQuery$UnsupportedDatastoreFeatureException:
Problem with query : Unsupported method <upper> while parsing expression: InvokeExpression {[null].upper(PrimaryExpression {station, usState})}

org.datanucleus.store.appengine.query.DatastoreQuery$UnsupportedDatastoreOperatorException:
Problem with query <SELECT station FROM com.newsplore.weather.bo.Station station WHERE
 UPPER(station.name) LIKE :stationName ORDER BY station.name>:
App Engine datastore does not support operator  LIKE

I can live without the UPPER but queries using LIKE are a must and I was quite surprised to find that they are not supported (TODO: study the BigMap ;). I have yet to find a replacement for LIKE…
Plain SQL is also unsupported (not a relational datastore…). As well, spatial extensions -if supported- are not documented. I’m using several spatial SQL queries in Spincloud, mostly to select data from a geographic area.

More datastore limitations and JPA issues
– Primary keys cannot be int types, I had to switch a bunch to Long.
– JPA queries need fully qualified class names; only “select s from com.package.Station s” will work.
– Inconsistent NullPointerException when accessing some entities the first time:

ERROR MainHandlerExceptionResolver:18  Exception processing page: Name is null
java.lang.NullPointerException: Name is null
	at java.lang.Enum.valueOf(Enum.java:195)
	at com.google.appengine.api.datastore.dev.CompositeIndexManager$IndexSource.valueOf(CompositeIndexManager.java:64)

After repeating the request the entities are fetched OK.
– No composite PKs allowed.
– java.sql.TImestamp not supported. The following exception is thrown when accessing an entity containing such type:

nested exception is org.springframework.dao.InvalidDataAccessApiUsageException: java.sql.Timestamp is not a support
ed property type.; nested exception is java.lang.IllegalArgumentException: java.sql.Timestamp is not a supported prop
erty type.]

I had to change all java.sql.Timestamp fields to java.util.Date. I’m a bit baffled about this one since Timestamp is a subtype of Date. I also changed the @Temporal annotations to TemporalType.DATE.
– No support for uniqueness constraints. I removed all “unique=true” clauses in @Column(name=”SOME_COLUMN”, unique=true)
– I couldn’t make @ManyToOne relationships work. I have such relationship for instance between Station (many) and Province (one). Within the transaction that persists a Station, a Province is associated with a Station so first the Province is loaded from the datastore by PK then set it into the new Station object. The error is:

Caused by: java.lang.IllegalArgumentException: can't operate on multiple entity groups in a single transaction. found both Element {
  type: "PROVINCE"
  name: "NO"
}
 and Element {
  type: "stations"
  id: 21
}

This is the same type of constraint that prevents working on multiple root entities within the same transaction.
– Not equals and IN operators are not supported
– I’m using method-level annotations for entities and I noticed that both the attribute and the getter/setter have to follow the JavaBean rules. With all other JPA providers I can have an arbitrary named field as long as accessors follow JavaBean rules.

Deployment
You can use the neat Eclipse plugin to push the webapp to the hosted GAE or use an included utility called appcfg.sh, your choice. One important limitation is that the total number of files that are allowed to be uploaded is 1000. This issue was solved in the Python environment with the introduction of the zipserve module but there’s no equivalent on the Java side yet. In the mean time, one way around it is to deploy a second webapp containing only static files and update the web pages accordingly (or use a real CDN).

Performance
Static pages perform quite well. Querying the datastore with indexed data ididn’t strike me as blinking fast. I can pull for instance 200 weather stations from the datastore (through JPA) and render the results on a JSP web page but if I bump the number of records to 1000 I bump into another limitation of the AppEngine which is a limit on the datastore request:

ERROR MainHandlerExceptionResolver:18  Exception processing page: datastore timeout:
operation took too long.
com.google.appengine.api.datastore.DatastoreTimeoutException: datastore timeout: operation took too long.
	at com.google.appengine.api.datastore.DatastoreApiHelper.translateError(DatastoreApiHelper.java:37)
	at com.google.appengine.api.datastore.DatastoreApiHelper.makeSyncCall(DatastoreApiHelper.java:55)
...

According to my non-scientific measurement the datastore timeout is around 5 seconds. A query that returns 1000 records would exceed this deadline. Arguably one shouldn’t issue queries that return such a high number of records but it’s better to be aware of this constraint.
I have also experienced an OutOfMemoryException on the hosted environment and I couldn’t find any facility to redeploy or somehow reset the runtime from the browser’s admin interface.

Production monitoring
Google provides an online web console to monitor the usage, check logs, check quotas and browse the live datastore. I found it useful but a bit more work is needed on the datastore browser to support GUI-oriented querying.

Usage and quotas
Google offers free usage up to a limit, afterwards there’s a paid scheme. The usage is free for up to 5 million page views per month which seems quite reasonable.

Other limitations
A list of what’s runnable in the AppEngine is found here Of all, I’ll focus on just one: saving files on the hosted file system is forbidden. I happen to save lots of image tiles for the radar layer so I’ll be forced to use the BigTable to store the image tiles, not necessarily a bad thing as it forces me to learn more about BigTable. On the other hand I have to point out that this limitation adds to the challenges of migrating to the AppEngine cloud.

Final thoughts
It’s quite clear that Google had an aggressive timeline for releasing GAE for Java. GAE(J) is still incomplete and there are some glaring omissions that should be corrected sooner rather than later:
– datastore backup, browser and bulk uploader tool
– spatial extensions support in the datastore. This is a must since given the opportunity for integration with Google Maps.
– background processing (coming soon)
– extending queries to support the LIKE operator or an alternative
– better JPA support. I mean better error messages and better documentation with unsupported features, proper Level 2 cache integration with memcache.
The roadmap is laid-out here

In conclusion, moving an existing project to Google’s infrastructure is not a trivial job. The differences between a traditional database and BigMap concepts are at times irreconcilable and code refactoring may prove to be too big a task. I happen to agree with Mr. Cranrell’s remarks regarding the challenges with regards to migrating to a non-relational datastore (perfect timing, the article was posted just today) but at the same time I have to appreciate the (revolutionary?) efforts from Google (BigMap) and Amazon (SimpleDB) to shift the market focus from RDBMS to perhaps better alternatives.

The AppEngine for Java is a worthy player in the new cloud services industry that is shaping as we speak. It is especially relevant for greenfield projects that don’t come with the standard baggage of constraints (most notably the relational database mapping) and where new and innovative ways of architecting an application can be explored.