Java Data API Design Revisited

When domain entities get bigger and more complex, designing a safe, usable, future-proof modification API is tricky.


This article is on a simple but effective approach to designing for data updates.


Providing an update API for a complex domain entity is more complicated than most developers initially expect. As usual, problems start showing when complexity increases.

Here’s the setup: Suppose your software system exposes a service API for some domain entity X to be used by other modules.

When using the Java Persistence API (JPA) it is not uncommon to expose the actual domain classes for API users. That greatly simplifies simple updates: Just invoke domain class setters, and unless the whole transaction fails, updates will be persisted. There is a number of problems with that approach though. Here are some:

  • If modifications of the domain object instance are not performed in one go, other code invoked in between may see inconsistent states (this is one reason why using immutables are favourable).
  • Updates that require non-trivial constraint checking may not be performed on the entity in full but rather require service invocations – leading to a complex to use API.

  • Exposing the persistent domain types, including their “transparent persistence” behavior is very much exposing the actual database structure which easily deviates from a logical domain model over time, leading to an API that leaks “internal” matters to its users.

The obvious alternative to exposing JPA domain classes is to expose read-only, immutable domain type interfaces and complement that by service-level modification methods whose arguments represent all or some state of the domain entity.

Only for very simple domain types, it is practical to offer modification methods with simple built-in types such as numbers or strings though, as that leads to hard to maintain and even harder to use APIs.

Alas, we need some change describing data transfer object (DTO – we use that term regardless of the remoting case) that can serve as a parameter of our update method.

As soon as updates are to prepared either remotely or in some multi-step editing process, intermediate storage of yet-to-be-applied updates needs to be implemented and having some help for that is great in any case. So DTOs are cool.

Given a domain type X (as read-only interface), and some service XService we assume some DTO type XDto, so that the (simplified) service interface looks like this:


public interface XService {
 X find(String id);
 X create(XDto xdto);
 X update(String id, XDto xdto);


If XDto is a regular Java Bean with some members describing updated attributes for X, there are a few annoying issues that take away a lot of the initial attractiveness:

  • You cannot differ a null value from undefined. That is, suppose X has a name attribute and XDto has a name attribute as well – describing a new value for X’s attribute. In that case, null may be a completely valid value. But then: How to describe the case that no change at all should be applied?
  • This is particularly bad, if setting some attribute is meant to trigger some other activity.
  • You need to write or generate a lot of value object boilerplate code to have good equals() and hashcode() implementations.
  • As with the first issue: How do you describe the change of a single attribute only?

In contrast to that, consider an XDto that is implemented as an extension of HashMap<String,Object>:

public class XDto extends HashMap<String,Object> {
  public final static String NAME = "name";
  public XDto() { }
  public XDto(XDto u) {
    if (u.containsKey(NAME)) { setName(u.getName()); }
  public XDto(X x) {
  public String getName() {
    return (String) get(NAME);
  public void setName(String name) {

Apart from having a decent equals, hashcode, toString implementation, considering it is a value object, this allows for the following features:

  • We can perfectly distinguish between null and undefined using Map.containsKey.
  • This is great, as now, in the implementation of the update method for X, we can safely assume that any single attribute change was meant to be. This allows for atomic, consistent updates with very relaxed concurrency constraints.
  • Determining the difference, compared to some initial state is just an operation on the map’s entry set.


In short: We get a data operation programming model (see the drawing below) consisting of initializing some temporary update state as a DTO, operating on this as long as needed, extracting the actual change by comparing DTOs, sending back the change


Things get a little more tricky when adding collections of related persistent value objects to the picture. Assume X has some related Ys that are nevertheless owned by X. Think of a user with one or more addresses. As for X we assume some YDto. Where X has some method getYs that returns a list of Y instances, XDto now works with YDtos.

Our goals is to use simple operations on collections to extend the difference computation from above to this case. Ideally, we support adding and removing of Y’s as well as modification, where modified Y‘s should be represented, for update, with a “stripped” YDto as above.

Here is one way of achieving that: As Y is a persistent entity, it has an id. Now, instead of holding on to a list of YDto, we construct XDto to hold a list of pairs (id, value).

Computing the difference between two such lists of pairs means to remove all that are equal and in addition, for those with the same id, to recures into YDto instances for difference computation. Back on the list level, a pair with no id indicates a new Y to be created, a pair with no YDto indicates a Y that no longer is part of X. This is actually rather simple to implement generically.

That is, serializated as JSON, the delta between two XDto states with modified Y collection would look like this:

    {“id”:”1”, “value”:{“a”=”new A”}},             // update "a" in Y "1"
    {“id”:”2” },                                   // delete Y "2"
    {“value” : {“a”=”initial a”, “b”:”initial b”}} // add a new Y

All in all, we get a programming model that supports efficient and convenient data modifications with some natural serialization for the remote case.


The supplied DTO types serve as state types in editors (for example) and naturally extend to change computation purposes.

As a side note: Between 2006 and 2008 I was a member of the very promising Service-Data-Objects (SDO) working group. SDO envisioned a similar programming style but went much further in terms of abtraction and implementation requirements. Unfortunately, SDO seems to be pretty much dead now – probably due to scope creep and lack of an accessible easy to use implementation (last I checked). Good thing is we can achieve a lot of its goodness with a mix of existing technologies.



Local vs. Distributed Complexity

As a student or programming enthusiast, you will spend considerable time getting your head around data structures and algorithms. It is those elementary concepts that make up the essential tool set to make a dumb machine perform something useful and enjoyable.

When going professional, i. e. when building software to be used by others, typically developers end up either building enabling functionality, e. g. low level frameworks and libraries (infrastructure) or applications or parts thereof, e. g. user interfaces, jobs (solutions).

There is a cultural divide between infrastructure developers and solution developers. The former have a tendency to believe the latter do somehow intellectually inferior work, while the latter believe the former have no clue about real life.

While it is definitely beneficial to develop skills in API design and system level programming, without the experience of developing and delivering an end-to-end solution however, this is like knowing the finest details on kitchen equipment without ever cooking for friends.

The Difference

A typical characteristic of an infrastructure library is a rather well-defined problem scope that is known to imply some level of non-trivial complexity in its implementation (otherwise it would be pointless):


Local complexity is expected and accepted.


In contrast, solution development is driven by business flows, end-user requirements, and other requirements that are typically far from stable until done and much less over time. Complete solutions typically consists of many spread out – if not distributed – implementation pieces – so that local complexity is simply not affordable.


Distributed complexity is expected, local complexity is not acceptable.


The natural learning order is from left to right:



Unfortunately many career and whole companies do not get past the infrastructure/solution line. This produces deciders that have very little idea about “the real” and tend to view it as a simplified extrapolation of their previous experience. Eventually we see astronaut architectures full of disrespect for the problem space, absurd assumptions on how markets adapt, and eventually how much time and reality exposure solutions require to become solid problem solvers.


Java EE is not for Standard Business Software

The “official” technology choice for enterprise software development on the Java platform is the Java Enterprise Edition or Java EE for short. Java EE is a set of specifications and APIs defined within the Java Community Process (JCP) – it is a business software standard.


This post is on why it is naive to think that knowing Java EE is your ticket to create for standard business software

I use the term standard business software for software systems that are developed by one party and used by many and that are typically extended and customized for and by specific users (customers) to integrate it with customer-specific business processes. The use of the word “standard” does not indicate that it is necessarily widely used or somehow agreed on by some committee – it just says that it is standardizing a solution approach to a business problem for a range of possible applications – and typically requires some form of adaptation before usable in a specific setting.

How hard can it be?

It is a myth that Java Enterprise development is harder than on other platforms – pre se. That is, from the point of view of the programming language and, specifically, the Java EE APIs, writing the software as such is not more complex compared to other environment. Complex software is complex, regardless of the technology choice.

In order to turn your software into “standard software” however, the following needs to be addressed as well:

You need an approach to customize and extend your software

This is only partially a software architecture problem. It is also means to provide your customer with the ability to add code, manage upgrades, integration test. Java EE provides very little in terms of code extensibility, close to nothing for modularity with isolation, and obviously it says nothing about how to actually produce software.

You need an operational approach

This is the one most underestimated aspect. While any developer knows that the actual Java EE implementation, the Java EE Server, makes a huge difference when things get serious, the simplistic message that an API standard is good enough to make the implementation indeed interchangeable has led to organizations standardize on some specific Java EE product.

This situation had positive side effects for two parties: IT can extend its claim, Java EE vendor can sell more licenses. And it has a terrible side effect for one party: You as a developer.

It’s up to you to qualify your software for different Java EE implementations of different versions. It’s up to you to describe operations of your software in conjunction with the specific IT-mandated version. When things go bad however, you will still get the blame.

Why is it so limited?

There is a pattern here: There is simply no point for Java EE vendors to extend the standard with anything helping you solve those problems, there is simply no point in providing standard means to help you ship customizable extensible business solutions.

Although it is hard to tell, considering the quality of the commercial tools I know of, but addressing the operational side and also solving modularity questions is definitely something that seemed to provide excellent potential for selling added value on the one side and effective vendor-lock-in on the other side.

This extends to the API specifications. When I was working on JCP committees in my days at SAP, it was rather common to argue that some ability should specifically be excluded from the standard or even precluded in order to make sure that you may well be able to develop for some Java EE server product but not in competition to it. And that makes a lot of sense from a vendor’s perspective. This is saying that

Java EE is a customization and extension tool for Java EE vendor solution stacks.


Not that any vendor was particularly successful in implementing that effect – thanks to the competition stemming from open source projects that have become de-facto standards such as the Spring Framework and Hibernate two name only two of many more.


Outside of an established IT organization, i.e. as a party selling solutions into IT organizations, it makes very little sense to focus on supporting a wide range of Java EE implementation and have yourself pay the price for it. Instead try to bundle as much infrastructure as possible with your solution to limit operational combinatorics.

To be fair: It is a good thing that we have Java EE. But one should not be fooled into believing that it is the answer to interoperabiltiy.


  1. Java EE,,_Enterprise_Edition
  2. JCP,

From a Human Perspective

When designing software that runs in a distributed environment, an extremely helpful tool is to look for slow-world analogies. As our brain thinks much more intuitively when considering human-implemented processes, finding flaws in system deployment architectures is significantly simpler in the analogy and surprisingly accurate.

In the analogy we identify

A thread An activity to attend to (e.g. sorting letters)
An OS process A worker, or more politely: A human
An OS instance (a VM) A home
A remote message A letter
A remote invokation A phone call
A file A file

You can easily go more fine-grained: A big server running a big database for example corresponds to a big administration building with lots of workers running around piling files in some huge archive packed with file cabinets.

In contrast some legacy host running a lot of under-equipped virtual machines is more like a … trailer park.

Asynchronous communication clearly corresponds to the exchange of letters while phone calls play the role of synchronous service calls and so perfectly allow to model scalability and reliability characteristics of both communication styles.

Some Examples

Example 1: De-coupling via asynchronous communication

It is not uncommon that crucial bottlenecks in a distributed architecture derive from some many-to-one state updates that was simply not taken seriously. I.e. many places synchronously call one place to drop off some state update.

In the anology it is perfectly obvious that having many people call in via phone is much more expensive in terms of capacity requirements and much less reliable than processing piles of letters – a work load that can be independently scaled, is very reliable, and makes good use of resources.

Example 2: Node-local search index

In online portals, a shared database can become a major data reading bottleneck that in addition needs to process most crucial updates as well. In the analogy this corresponds to a blackboard (the DB) and many remote workers (the front ends) calling in to ask for some piece of information. It is much more efficient to hand a periodically updated copy (a catalog) out to the front end workers.

Example 3: Zero-Downtime deployment

This is a particularly nice one. The problem addressed by ZDD is that in a distributed setup, a partial roll out of a new software version introduces some not completely trivial compatibility constraints. In particular, any shared resource (a database, a shared service), when upgraded, still needs to accept interactions with some range of previous software versions running on its clients. In the analogy this corresponds to remote offices where clerks still use an old form in some and a new form version in other offices. A central office needs to be able to process old forms as well as new revisions. Likewise when sending out information to remote offices, it needs to be presented in a format comprehensible by clerks that have not been trained for the new version and yet needs to comply to the latter as well. All ZDD requirements for the IT analogy follow.

I guess, you get the point and I will stop here.

A Final Note

One last piece however, an axiom to the whole idea, if you will, is the

Underlying principle: We all are built the same – we just happen to do different things

Considering traditional labor, this is pretty much true in the real world. It should similarly be true for your solution: If your (anology) workers are overspecialized (can only speak on phone, will not process paper forms…) for no other reason than a deployment diagram that seemed to be a good idea at some time, you are in for trouble mid-term.

That is: As a general principle (modulo well-justified exceptions) all nodes in your deployment decomposition can – in principle – do any kind of application work, from rendering a front end to computing a report.

As a corollary this implies that: Not doing something but still being able should not incur pain in terms of added deployment and configuration complexity. (see also modularization and integratedness).


Refactoring-safe referencing of bean properties

Currently starting to overhaul an old idea for a rather handy (sort of DSL’ish) query API in Java that exposes some properties I very much miss in APIs like Query DSL.

A typical problem when designing data access APIs or any other API that binds some data structure to Java Beans is that you cannot directly refer to bean properties in a refactoring-safe way when constructing expressions. To do so you make use of string constants, thereby denoting property names redundantly. The advantage of bean properties over string constants however is that refactoring tools recognize usage throughout a complete codebase, so that changing internal data naming is a straightforward and low-risk task.

The approach taken by tools such as the (dreadful) JPA criteria API or Query DSL is to offer generation of Companion Types for bean types. The companion types expose access to property names and more. As code generation – in particular code generation  involving the IDE – that generates code that is referenced by name from hand-typed code, extends the compiler food chain to an even more intrusive beast – even introducing IDE dependencies – this approach is not only ugly, it asks to trouble mid-way and cannot be a sane choice long term.

Here is another approach:

Based on the Java Bean Specification we have a one-to-one relationship between bean properties and its read methods (the getters). In Java, method meta-data is not as directly accessible as class meta-data via the reflection API. That is, unlike


to access a class name in code, there is nothing like


for methods. In order to retrieve the property association via a getter method, we can however make some careful use of byte-code trickery. Using the Javassist library, we can generate a support extension – a meta bean – of the original bean type, that, when invoking its getters, provides us with the associated property name.

In essence this works as follows (see the code below): After retrieving a meta bean (that may be held on to), invoking a getter leaves the corresponding property name in a thread local. A helper method reads the thread local and resets it. So, continuing the example below,

MyBean mb = MetaBeans.make(MyBean.class);

would output


As a neat extension, the artificially created getters return (whenever sensibly possible) meta beans itself, and when finding a non-empty property name held by the thread local storage, instead of setting it to the property name, the property name will be appended, so that

import static MetaBeans.*;

MyBean mb = make(MyBean.class);

would output

Using this approach requires no further tooling whatsoever, can easily be extended to other use-cases, is completely refactor-safe, and comes at diminishing costs.

Note that the implementation below is not made to run with module-system-type class loader setups, is somewhat crude, and is really just meant to illustrate the idea. Consult the Java Assist API for more information on managing class pools.

Here is the MetaBeans class:

 public class MetaBeans {
    private static ThreadLocal<String> properties = new ThreadLocal<String>();

     * Create a meta bean instance. If not eligible, this method throws an IllegalArgumentException.
     * @param beanClass the bean class to create a meta bean instance for
     * @return instance of meta bean
    public static <T> T make(Class<T> beanClass) {
        return make(beanClass,true);

     * Create a meta bean instance. If not eligible, return null.
     * @param beanClass the bean class to create a meta bean instance for or null, if the class is found to be not eligible.
     * @return instance of meta bean
    public static <T> T makeOrNull(Class<T> beanClass) {
        return make(beanClass,false);

     * Track meta bean invocations and return property path.
    public static String p(Object any) {
        try {
            return properties.get();
        } finally {

     * Internal.
    public static void note(String name) {
        String n = properties.get();
        if (n==null) {
            n = name;
        } else {
            n += "."+name;

    // private 

    // actually provide an instance
    private static <T> T make(Class<T> beanClass, boolean nullIfNotEligible) {
        try {
            Class<?> c = provideMetaBeanClass(beanClass, nullIfNotEligible);
            if (c==null) {
                return null;
            return beanClass.cast(c.newInstance());
        } catch (Exception e) {
            throw new RuntimeException("Failed create meta bean for type "+beanClass,e);

    // try to provide a meta bean class or return null if note eligible
    private static Class<?> provideMetaBeanClass(Class<?> beanClass, boolean nullIfNotEligible) throws Exception {
        // check eligibility
        StringBuilder b = checkEligible(beanClass);
        if (b.length()>0) {
            if (nullIfNotEligible) {
                throw new IllegalArgumentException("Cannot construct meta bean for "+beanClass+" because: \n"+b.toString());
            return null;
        String newName = metaBeanName(beanClass);
        // check if the class can be found normally or has been defined previously
        ClassPool pool = ClassPool.getDefault();
        CtClass cc = pool.getOrNull(newName);
        if (cc==null) {
            // ok, need to construct it.
            // start constructing
            cc = pool.makeClass(newName);
            // as derivation of the bean class
            CtClass sc = pool.get(beanClass.getName());

            // override getters
            for (PropertyDescriptor pd : Introspector.getBeanInfo(beanClass).getPropertyDescriptors()) {
                String pn = pd.getName();
                Method g = pd.getReadMethod();
                if ( (g.getModifiers() & (Modifier.FINAL | Modifier.NATIVE | Modifier.PRIVATE)) ==0) {

                    // fetch return type (pool will retrieve or throw exception, if it cannot be found)
                    CtClass rc = pool.get(g.getReturnType().getName());
                    // create the new getter
                    String body = "{"+
                        // add a cast as Java Assist is not great with generics it seems
                        "return ("+g.getReturnType().getCanonicalName()+") "+MetaBeans.class.getName()+".makeOrNull("+g.getReturnType().getCanonicalName()+".class);"+

                    CtMethod m = CtNewMethod.make(
                        new CtClass[0],
                        new CtClass[0],
            return cc.toClass();
        } else {
            return Class.forName(newName);

    private static String metaBeanName(Class<?> beanClass) {
        String newName = beanClass.getCanonicalName()+"__meta";
        return newName;

    private static StringBuilder checkEligible(Class<?> beanClass) {
        StringBuilder b = new StringBuilder();
        if (beanClass.getPackage().getName().startsWith("java.lang")) {
            b.append("No meta beans for standard types\n");
        } catch (NoSuchMethodException nsme) {
            b.append(beanClass.toString()).append(" has no default constructor\n");
        return b;



On integratedness or the math of updates

Last year, in a talk at Codemotion Berlin (see here) I described as one of the hurdles in keeping development productivity up when systems grow the poor model match between runtime and design time. Turns out that was an awfully abstract way of saying “well something like that”.

At last I am a little smarter know and I’d rather say it’s about the integratedness.

This post is about:

What makes us slow down when systems grow large, and what to do about it?



A lot of things happen when systems grow. And there is more work on this topic around than I could possibly know about. In fact, what I will concentrate on is some accidental complexity that is bought into at some early stage, then neglected, and typically accepted as a fact of life that would be to expensive to fix eventually: The complexity of updates as part of (generic) development turnarounds.

While all projects start small and so any potential handling problem is small as well, all but the most ignorable projects eventually grow into large systems, if they survive long enough.

In most cases, for a variety of reasons, this means that systems grow into many modules, often a distributed setup, and most-definitely into a multi team setups with split responsibilities and not so rarely with different approaches for deployment, operations, testing, etc.

That means: To make sure a change is successfully implemented across systems and organizational boundaries a lot of obstacles – requiring a diverse set of skills – have to be overcome:

Locally, it has to be made sure that all deployables that might have been affected are updated and installed. If there is a change in environment configuration this has to documented so it can be communicated. Does the change imply a change in operational procedures? Are testing procedures affected? Was there a change in the build configuration? And so on.

Now suppose for an arbitrary change (assuming complete open-mindedness and only the desire for better structure) there is n such steps that may potentially require human intervention or else an update will fail. Furthermore assume that we have some minimal probability p that we run into failure. Then the probability that an update succeeds is at most:


What we get here is a geometric distribution on the number of attempts required for a successful update. That means, the expected number of attempts for any such update is:


which says nothing else but that

Update efforts grow exponentially with the number of obstacles.

While the model may be over-simplified, it illustrates an important point: Adding complexity to the process will kill you. In order to beat an increasing n, you would have to exponentially improve in (1-p) which is … well … unlikely.

There is however another factor that adds on top of this:

In reality it all works out differently and typically into a sort of death spiral: When stuff gets harder because procedures get more cumbersome (i.e. n grows), rather than trying to fix the procedure (which may not even be within your reach) the natural tendency is be less open-minded about changes and rather avoid the risk of failing update steps altogether by constricting one’s work to some well-understood area that has little interference with others. First symptoms are:

  • Re-implementation (copy & paste) across modules to avoid interference
  • De-coupled development installations that stop getting updates for fear of interruption

Both of these happen inevitably sooner or later. The trick is to go for later and to make sure boundaries can be removed again later (which is why in particular de-coupling of development systems can make sense, if cheap). Advanced symptoms are

  • Remote-isolation of subsystems for fear of interference

That is hard to revert, increases n, and while it may yield some short term relieve, it almost certainly establishes an architecture that is hard to maintain, makes cross-cutting concerns harder to monitor.

With integratedness of the system development environment, I am referring to small n‘s and small p‘s. I don’t have a better definition yet, but its role can be nicely illustrated in relation to to other forces that come into play with system growth: The systems complexity and its modularity. While the system grows so does (normally) its overall complexity grow. To keep the complexity at hand under control we apply modularization. To keep the cost of handling under control, we need integratedness:


One classic example of an integrated (in the sense above) development and execution environment is SAP’s ABAP for its traditional ERP use-case. While ABAP systems are huge to start with (check out the “Beispiele” section in here), customers are able to add impressively large extensions (see here).

The key here for ABAP is: Stuff you don’t touch doesn’t hurt you. Implementing a change makes it available right away (n=1 for dev).


  1. Lines_of_Code (Beispiele), German Wikipedia
  2. how many lines of custom ABAP code are inside your system?, SAP SCN
  3. System-Centric Development


The Linda Problem of Distributed Computing

Suppose an important function of your solution is pricing calculation for a trading good.

What is the more appropriate solution approach:

  1. You develop a software module that implements pricing computation
  2. You develop a REST server that returns pricing computation results

I am convinced that more than a few developers would intuitively chose b).

Taking a step back and thinking about it some more (waking your lazy “System 2”) it should become clear that choice a) is much stronger. If you need to integrate pricing computation in a user interface, need a single process deployment solution, AND a REST interface – it’s all simple adaptations of a). While having b) gives little hope for a). So why chose b)?

This, I believe to be an instance of a “conjunction fallacy”. The fact that b) is more specific, more tangible, more representative as a complete solution to the problem makes it more probable to your intuition.

Back to the observation at hand: Similar to the teaser example above, I have seen more than one case where business functions got added to an integration tier (e.g. an ESB) without any technological need (like truly unmodifiable legacy systems and the like). An extremely poor choice considering that remote coupling is harder to maintain, has tremendously more complex security and consistency requirements. Still it happens and it looks good and substantial on diagrams and fools observers into seeing more meaning than justified.

Truth is:


Distribution is a function of load characteristics not of functional separation


(or more generally speaking: Non-functional requirements govern distribution).

The prototypical reason to designate boxes for different purposes is that load characteristics differ significantly and some SLA has to be met (formally or informally). For many applications this does not apply at all. For most of the rest a difference between “processing a user interaction synchronously” and “performing expensive, long-running background work asynchronously” is all that matters. All the rest is load-balancing.

Before concluding this slightly unstructured post, here’s a similar case:

People love deployment pipelines and configuration management tools that push configuration to servers or run scripts. It definitely gives rise to impressive power-plant-mission-control-style charts. In reality however: Any logical hop (human or machine) between the actual system description (code and config) and the execution environment adds to the problem and should be avoided (as probability of success decreases exponentially with the number of hops).

In fact:


The cost of system update is a function of the number of configurable intermediate operations from source to execution


and as an important corallary:


The cost of debugging an execution environment is a function of the number of configurable intermediate operation from source to execution


More on that another time though.

This post was inspired by “Thinking, Fast and Slow” by Daniel Kahneman that has a lot of eye-opening insights on how our intuitive vs. non-intuitive cognitive processes work. As the back cover says: “Buy it fast. Read it slowly”


Dependency Management for Modular Applications

While working on a rather cool support for Maven Repositories in Z2 (all based on Aether – the original Maven dependency and repository implementation), I converted several scenarios to be using artifacts stored in Maven central. This little piece is on what I ran into.

It all starts with the most obvious of all questions:

How to best map Java components (Z2 term) from Maven artifacts?

As Java components (see here for more details) are the obvious way of breaking up stuff and sharing it at runtime, it is natural to turn every jar artifact into a Java component – exposing the actual library as its API. But then, given that we have so much dependency information, can we map that as well?

Can we derive class loader references from Maven dependencies?

Maven has different dependency scopes (see here). You have to make some choices when mapping to class loader references. As compile dependencies are just as directed as class loader references, mapping non-optional compile dependencies to class loader references between Java components seems to be a rather conservative and still useful approach.

At first glance this looks indeed useful. I was tempted to believe that a lot of real-world cases would fit. Of course there are those adapter libraries (like spring-context-support), that serve to integrate with a large number of other frameworks and therefore have tons of optional compile dependencies. Using them only with non-optional compile dependencies on the class path completely defeats their purpose. (For example spring-context-support is made to hook up your Spring application context with any subset of their target technologies – and only that needs to be on the effective class path.)

But there are other “bad” cases. Here are two important examples:

  • Consistency: The Hibernate O/R Mapper library integrates with the Java Transaction API (JTA) – and it definitely better should. So it does have a corresponding non-optional compile dependency on JTA, but not on javax.transaction:jta but rather JBOSS’s packaging of it. Spring-tx (the transaction utilities of the Spring framework) however has a compile dependency on javax.transaction:javax.transaction-api. Using both with the trivial mapping would hence lead to some rather unresolvable “type misunderstanding” a.k.a. class cast exception. So there is consistency problems around artifacts that have no clear ownership.
  • Runtime constraints: The spring-aspect library, that enables transparent, automatic Spring configuration for objects that are not instantiated by Spring, associates a Spring application context with a class loading namespace by holding on to an application context by a static class member. That is: Every “instance” of the spring-aspect library can be used with at most one application context. Hence, it makes no sense what so ever to share it.

So there are problems. Then again: Why split and share at runtime with class loader isolation anyway? Check the other article for some reasoning. Long story short: There is good cases for it (but it’s no free lunch).

Assuming we want to and considering the complexities above, we can conclude that we will see some non-trivial assembly.

Here are some cases I came up with:


  1. Direct mapping: Some cases allow to retain the compile dependency graph one to one as class loading references. This does however require to not rely on anything that is expected to be present on the platform. E. g. “provided” dependencies will not be fulfilled and most likely we end up in case 3.
  2. Isolated Inclusion: For the case of of spring-aspects, it assumes to be a class loading singleton. It is one for each application context. So, if you want to have multiple application contexts, you will need multiple copies of spring-aspects. This is true for all cases where some scope that can be instantiated multiple times is bound to some static class data.
  3. Aggregation: the not so rare case that a dependency graph has a single significant root and possibly some dependencies that should be excluded as they will be satisfied by the environment, it is most practical to form a new module by flattening the compile dependency graph (omitting all exclusions) and satisfy other runtime requirements via platform class loading means. The role model for this is Hibernate.
  4. Adapter Inclusion: Finally, we consider the case of a library that is made to be used in a variety of possible scopes. A great example is spring-context-support that provides adapters for using Spring with more other tools than you probably care in any single setup. If spring-context-support would be used via a classloading reference, it could not possibly be useful as it would not “see” the referencing module’s choice of other tools. By including (copying) it into the module, it can.


If you follow the “normal” procedure and simply assemble everything into one single web application class loader scope, you can spare some thinking effort and get around most of the problems outlined above. If you however have the need to modularize with isolated class loading hierarchies, you need to spend some time understanding toolsets and choosing an assembly approach. This does pay off however, as something that happened accidentally has now become a matter of reasoning.


  1. Z2 Maven Repo Support
  2. Z2 Java Components
  3. Maven Dependency Scopes


User Friendly Production Updates

February is almost over – so little time to spend on this blog. Here is a short post on something cool we did a while back in Z2.

The original request was along the lines of


Can we do an software upgrade without interrupting a user’s work?


This is in the context of the not so insignificant class of applications that require users to operate on some non-trivial but temporary and yet rich state before a persistent state change of the system can be performed. That is, applications that keep non-trivial session state where kicking out users means a real loss of time and nerves and is more than unfriendly.

Still – what if there is an important update to an intranet application that should be applied and, say, it should be tried by some group of users after lunch?

Applying a software upgrade to a running application without interfering with the work progress of currently logged-in users has some natural limitations. For example, smart data migrations will be extremely hard to get right (in a stateful scenario). But anything above the domain level might actually work.

Theoretically the most natural approach would be to temporarily store user session data somewhere, “replace” the application and load the user session data into memory again. Practically speaking however, it will be hard to find complex Java applications that use session serialization and would be assumed to reliably save and restore a user session. Furthermore, during the time of the application restart there would still be some downtime that may be taken as a system failure by users.

So, instead of doing something smart why not do something really obvious:


Leave the application running until the last user has logged off (or was logged off due to session expiration). Present the new application version to all user that log in after the update.


In fact, the approach we took leaves the whole (frontend) application stack running until the last session “running” has become invalid. It is implemented by the Gateway module and described in detail in the wiki. Here is a short summary:

Normally, a Z2 in “server mode” has at least two processes running: A home process that serves as a watch dog and synchronization service plus a webWorker node that runs the (Jetty) Web server and whatever Web applications are supposed to be up.

With the Gateway module this setup is altered (by configuration) in that the actual Web server entry point is now running in the home process and forwarding requests on a by-session scheme to the actual webWorker process. Worker processes, such as the webWorker, can now be detached from the synchronisation procedure. That is, instead of being stopped because of changes, detached worker processes are simply left unaffected from any updates until nobody needs them anymore:


Using that mechanism, users can decide to complete their current work and decide to upgrade at their convenience by logging off and on again.

Finally, this is another cool show case of how beneficial worker process control within the execution environment is.


Secondary Indexes in HBase

Fundamental problems in solution development can be deceiving – in particular when considered from a conceptual distance (i.e. being somewhat clueless). At one time they look too hard to address as one more of the many side topics one has to keep under control, another time they turn into something completely manageable – only just to hit you hard once you turn around again.

One such topic that kept me busy lately is secondary indexing for HBase stored data.

As such, HBase does not have any built-in support for secondary indexes. All data is ordered by row keys and that’s it (in fact, HBase does not store rows but rather versioned columns).

Why is it that HBase does not have secondary indexes while products in the same space, such as Cassandra and Hypertable do? I would put it this way:

Due to the distributed nature of Bigtable style database systems, there is no good single solution to secondary indexing – but one rather has to chose between trade-offs to find a suitable approach for a given use-case

(Start you exploration here).

That would not necessarily need to mean that there could not be a variety of implementations available out of the box. Unless there is some stronger type system support and more agreement and experience with the various approaches, I guess, this will however still take some time.

This post is on three different approaches I learned about lately and what I think are their main qualities (with some interesting links).

Global Index Table

This is the most obvious (and still useful) approach. In its simplest version: Given some row r with row key r.key to index by some field value r.val, store a mapping

${r.val} / ${r.key}

as row keys in another table. Finding all rows with r.val==x then simply means to scan for the row keys following “x/” . As invalid index entries can be readily identified by checking the row for the expected field value keeping a consistent index is really easy by making sure that index entries are written before row updates are written and by cleaning out stale entries once in a while. So, the implementation is really not that hard and and it is effective (see also Global Index Table references below). Due to its simplicity and robustness, it would be highly desirable to have some utility implementation generally available.

Furthermore it is space-efficient, indexes can easily be regenerated if things go bad, you need no further software but HBase, and you can be really smart about the value to index as its all application based.

There are significant downsides as well: Getting a single row via the index is two roundtrips and, worse, scanning rows via the index implies to – in the worst case – to have randomly distributed “point gets”.



The latter is obviously the most significant downside and if you need to efficiently scan rows via an index this approach is not for you.

Region-level Index

An approach that is more focussed on adding efficiently to write secondary indexes to the database implementation is to add region server or region level indexes. That is, integrated with HBase’s Write-Ahead-Log (WAL), indexes are written directly and atomically for data within the region server local regions to the region server. So, every region server holds indexes for its data. This approach is implemented by HIndex and IHBase.

This is good for the write performance. Reading data via the index does however require to contact each region in question, as there is no way of knowing where not to look for index content (e.g. unlike the global index table that uses regular HBase row key ordering).


Effectively that means that read-throughput does not scale up with the number of region servers, but due to parallelization it does not degrade with data size either. So, in a way, like a classic RDBMS it does not scale vertically in throughput but unlike a classic RDBMS it does scale vertically in space.

Covering Index

Covering indexes are essentially materialized views without a name. That is, instead of fishing for the data via some key indirection, the actual data is copied into the index table which employs row keys corresponding to the indexed data rather than the original row key.

So, index-based queries don’t need to go anywhere but just into the right table where all data is already present. Consequently read performance is great.


Updating the index however is much more complicated now. All copies need to be updated when changing data – unlike the global table approach there is no simple verification possible while still preserving the performance attributes.

While this approach does perfectly preserve the scalability attributes of HBase, indexes may still come at a steep price: Every new index potentially duplicates a significant portion of the original data.

Secondary indexing in Phoenix, a SQL-implementation for HBase, is based on this approach. And indeed query-performance is one of the strong selling points while there is still great effort put in making sure index updates are essentially atomic (and WAL-integrated) and performing well.


Not being really deep into the implementation details of the more complicated approaches, I must admit that I find Region-Level-Indexing the least attractive option. It looks like something that is asking for trouble in large setups – which is after all exactly what you expect to happen, once you come looking for HBase.

Covering indexes are really interesting – for analytical applications: If you need fast query performance over a large data this looks good. In analytical applications you typically know about the dimensions that need to be present in result sets.

For OLTP-style applications scanning via indexes is often a less prominent access method compared to single but complete row retrieval for display and manipulation by a variety of conditions. That is, global index tables may still be the more space-efficient and flexible means to that end, while still being sufficiently read efficient.

Ideally it should be (and quite possibly is already) possible to combine both approaches.



Indexing on Hbase

  1. Secondary Indexes and Alternate Query Paths, HBase documentation,
  2. Musings on Secondary Indexing in HBase, Jesse Yates
  3. HBASE-2038, HBase issue tracker discussion

Global index tables

  1. Consistent Enough Secondary Indexes, Jesse Yates
  2. Musings on Secondary Indexes, Lars Hofhansl

Region-Level indexing

  1. HIndex at Github
  2. HIndex at Intel
  3. IHBase

Covering Indexes

  1. Phoenix
  2. Phoenix Secondary Indexing

Other indexing tools for HBase

  1. Culvert
  2. Lily

Other Big Table style database products with secondary indexing

  1. Hypertable
  2. Cassandra (and here)