Wednesday 26 January 2011

Low latency distributed parallel joins

When MySQL AB bought Sun Microsystems in 2008 (or did Sun buy MySQL?), most of the MySQL team merged with the existing Database Technology Group (DBTG) within Sun. The DBTG group had been busy working on JavaDB, Postgres and other DB related projects as well as 'High Availability DB' (HADB), which was Sun's name for the database formerly known as Clustra.

Clustra originated as a University research project which spun out into a startup company and was then acquired by Sun around the era of dot-com. A number of technical papers describing aspects of Clustra's design and history can be found online, and it is in many ways similar to Ndb Cluster, not just in their shared Scandinavian roots. Both are shared-nothing parallel databases originally aimed at the Telecoms market, supporting high availability and horizontal scalability. Clustra has an impressive feature set and many years of development behind it, but limited exposure to general purpose use.

At the time of the MySQL acquisition, HADB/Clustra was embedded in a number of Sun products as a session store and metadata repository, but was not available for external customers for general purpose use. Shortly afterwards, a decision was made to move HADB into a 'sustaining' model, and most of the ex-HADB team then became available to work on other projects. MySQL has greatly benefited from the injection of skills and enthusiasm from the Sun DBTG across a number of different teams, which is maybe not well known to those outside the company.

In the Cluster team, one project which has really benefited is the SPJ (Select Project Join) project, which couldn't have happened without the expertise and energy of the ex-Clustra/HADB team working on it.

The SPJ project started around the time of the last MySQL Developers conference in Riga in September 2008. The intention at the time was to look at ways of efficiently supporting more complex queries, specifically involving table joins, reducing unnecessary data transfer, communication latencies and context switches and increasing parallelism.

The main insights at the start of the project included :
  • Join mechanisms should be based on linking existing NdbApi single-table access primitives
  • Join result sets need not and should not be fully materialised at the data nodes
  • Join mechanisms need not be fully general or fully capable initially, as full generality/capability is already available with the existing Apis
  • Small joins should be targeted (Number of rows, number of tables)

These points have simplified the project scope greatly, and allowed many painful and costly detours to be avoided.

SQL execution in MySQL Cluster

As described in a previous post, Ndb Cluster was originally designed to quickly execute small queries with a high update rate at low latency. Larger more complex queries were executed by a separate query processor. The emphasis in this design is that complex queries are possible, but not necessarily fast or efficient. A main goal is that complex queries do not adversely affect the properties of the high volume, low latency requests.

All data access in MySQL Cluster is via the NdbApi interface. The NdbApi gives access to data stored in tables in the Cluster via four primitives :

  • Read row by primary key
  • Read row by secondary unique key
  • Scan a range of rows in an ordered index with optional conditions
  • Scan all rows in a table with optional conditions

Each of these primitives operates on a single table, and any table joins must be done by the NdbApi user. Different primary key/unique key operations and individual scan operations run in parallel across the data nodes in a cluster. NdbApi allows different operations and scan requests to be batched together to minimise latency due to communication delays, and it is essential to use this batching to get minimal latencies with Ndb.

In MySQL Cluster, attached MySQL Servers act as query processors, and MySQL's SQL execution engine breaks complex queries down into calls to its generic Storage Engine (SE) Api, which also deals with data access one table at a time.

The Ndb storage engine then further decomposes these SE Api calls into NdbApi primitive operations on individual tables.

MySQL supports SQL queries by performing the SE Api calls to read data, then comparing and matching results, sorting, buffering and reformatting. This works very well and gives MySQL Cluster great SQL functionality and compatibility, although users may find that the latency of their individual queries is not as low as with other engines such as MyISAM and InnoDB, which do not have to perform inter-process communication to implement their SE Api calls.

For minimum latency, Ndb requires that the MySQL Server makes efficient requests for data, requesting as much data as possible at once, and not using the data until it is essential to make forward progress - e.g. when there is a real data dependency.

MySQL features such as Insert, Update and Delete batching, and Batched Key Access minimise the MySQLD to Data node round trips required to execute certain types of operations, but they are unable to help when there are real data dependencies in a query. For example when the server needs to read some value from table t1 to know which rows to read from table t2 then there is no alternative but to read the t1 rows into memory before issuing any reads from t2.

Linked Operations

To reduce the need for extra Api to Data node round trips for every data dependency, we must allow operations to be linked. If we can describe the data dependency as a link between NdbApi operations, then it can be resolved amongst the data nodes. For example, rather than stating :


SQL > select t1.b, t2.c from t1,t2 where t1.pk=22 and t1.b=t2.pk;
ndbapi > read column b from t1 where pk = 22;


[round trip]


(b = 15)
ndbapi > read column c from t2 where pk = 15;


[round trip]


(c = 30)
[ return b = 15, c = 30 ]


We would state the join/operation linkage at the ndbapi level :


ndbapi > read column @b:=b from t1 where pk = 22;
read column c from t2 where pk=@b;


[round trip]


(b = 15, c = 30)
[ return b = 15, c = 30 ]



We allow read operations to be parameterised on the results of previous operations, and have the linking of the operations, the flow of results into parameters, handled by the data nodes. The data dependency still results in some execution serialisation at the data node layer, but not at the api layer, so data dependencies within queries needn't result in extra round trips between the MySQL server and the data nodes. Where the dependent data happens to be on the same data node, the dependency can be resolved with no inter-process communication at all.

Viewing the database software as a stack, the execution of the join is being 'pushed down' the stack, to a lower layer. For this reason, the SPJ functionality is also sometimes referred to as pushed-down joins or just pushed joins. Pushing functionality closer to the data can result in improved performance due to lower latency, reduced data transfer etc. In the case of MySQL Cluster, it can avoid inter-process communication, as well as enable parallelism across the data nodes.

In theory, linking can occur between any of the 4 primitive operation types :
  • Primary key access (PK)
  • Unique key access (UK)
  • Ordered index range scan (OI)
    (Range bounds and optional conditions parameterised)
  • Table scan (TS)
    (Optional conditions parameterised)

In practice, linking the cardinality (0|1) operations (Primary key, Unique key) together is simpler than linking with the scans. In turn, linking a scan to a cardinality (0|1) operation is simpler than linking a scan to another scan.

Linking a table scan to a table scan results in a cross-join and is probably going to be unpleasantly expensive for anything other than small tables.

The initial SPJ implementation supports combinations of Primary/Unique key operations linked together with at most one ordered index scan.

A future implementation will support multiple ordered index scans in a single request. This is more complex to handle due to the buffering required of the different scan result sets, and the resulting result ordering versus efficiency tradeoffs.

The SPJ Api is implemented as an extension to the existing NdbApi, with similar primitive concepts, but with the addition of the means to link the primitives together. As with the existing NdbApi, the usage pattern is along the lines :

  • Define operation(s)
  • Define further linked operation(s)
  • Execute() // One round trip to the data nodes
  • Examine results

In terms of batching, a tree of linked operations, with one const-parameterised root operation, and one or more child operations, is considered to be a single operation. Multiple SPJ operations, each actually a tree of primitive operations, can be executed simultaneously in a batch, along with other 'basic' NdbApi operations.

Where a scan is included, the scan can be advanced using the normal nextResult() mechanism, which also advances the results returned by any cardinality (0|1) child operations.

NoJoins - Not only Joins

While the SPJ extensions are described here in terms of joins, at the NdbApi level they are really 'linked operations'. One design goal which is not completely aligned with the join concept was to allow scans of multiple different tables to be parameterised on a single root operation. For example :

  • read @eid:= entity_id from map_table where username="jan";
  • scan blog_titles from blog_posts where entity_id=@eid;
  • scan latest_tweets from twitter_feed where entitiy_id = @eid;
  • scan share_prices from stock_feed where entity_id = @eid;
  • ....
Here there is a data dependency between the first lookup and n peer child scans. I want to read all of this data in one round trip, but I don't necessarily want to have to express this in a single 'join' query. If we had a more relational/SQL oriented Api we might have had to create some unholy union of the different results, with masses of repeated values or nulls, or repeat the first lookup for each of n two-way joins.

With the linked operation concept, we can clearly state that the child scans are parameterised by the first lookup, without having to introduce some further unnatural coupling between the rows returned by each scan, which are otherwise independent.

So although SPJ is named after and described as supporting joins, it doesn't mean that you have to be 'join-oriented' or a SQL Samurai to benefit from it. It may be quite useful for efficiently traversing graphs, hierarchies and other links between rows where the concept of a 'join' is quite alien.

Check it out

A mysql-5.1-cluster-7.1 source tree with the SPJ enhancements can be downloaded from here. You can see the NdbApi extensions in the storage/ndb/include/ndbapi directory of the source tree. This source also includes extensions to the MySQL Ndb handler to make use of the new SPJ Api for SQL queries, which I hope to describe a little next time. If you want to download and try out SPJ then see some of the other blog posts about how to get started with it.