Thursday, July 30, 2009

Middleware (Intersystems Cache)

...as in connected systems, the original title to which this blog just reverted. I just noticed lately that there has been a bewildering proliferation of offerings in this space, some having to do with the Cloud, some having to do with verticals such as health care. So I will try to make sense of some of these things next.

Intersystems offers a database platform (Caché), a RAD toolset (Ensemble), a BI system (DeepSee), and two products specifically targeted at healthcare, an information exchange platform (HealthShare) and a web based IS (TrakCare). So if I was to put everything in a matrix comparing different vendors, it would almost have to be a 3D one - one dimension would have to cover the platform, and another (the depth) would have to cover the vertical (healthcare), as for example, Microsoft offers both the platform and the vertical app.

Caché is a combination of several products, some of which originate in MUMPS, which is a healthcare-specific programming language developed in the 1960's. MUMPS used hierarchical databases and was an underpinning of some of the earliest HIS developments (Wikipedia is our friend); at some point it ran on PDP-11, which incidentally was the first computer I did ever see.

It makes one wonder what would have happened had MUMPS became the database standard as opposed to what would become Oracle, as MUMPS predates R2 (and C, by the way). But the close connection between the language and the database, which might strike some today as strange, goes back to its origins.

Caché's performance stems from its sparse record architecture, and from its hierarchical (always-sorted) structure.

Caché has been modernized to provide a ODBC-compliant interface (and derivatives: ADO.NET) and an object-oriented 'view' of the data and functionality embedded in MUMPS (ObjectScript). The development environment also offers a BASIC-type of programming language and a website construction toolkit, quite a lot for one standalone package.

It seems that Caché is a document-oriented database, which would make it similar to a XML database in some ways - the main 'entities' are arrays in one case, nodes in the other as opposed to relational tables.

At the same time, for a hierarchical database, Intersystems somewhat confusingly portrays it as an "object" database, which is probably not technically incorrect, since one of the views of the data is "object"-based as I mentioned above.

Creating a class in Caché also creates a table accessible in SQL (via the database interfaces, or through the System Management Portal). The table has a few additional fields on top of the class' properties - an ID and a class name (used for reflection, I assume). The System Management Portal also provides a way to execute SQL statements against the database, although at first sight I cannot seem to create a new data source in Visual Studio - and have to access the data programmatically.

One of the ways using the database from Microsoft Visual Studio requires the use of a stub creator app - CacheNetWizard, which failed every time I tried to use it. The other is to use the Caché ADO.NET provider:

command = new CacheCommand(sql, cnCache);
CacheDataReader reader = command.ExecuteReader();
while (reader.Read())
{
noRecsRead++;
if (noRecs > 0 && noRecsRead >= noRecs)
break;
}

Running a large operation (a DELETE, in this case) from one client seems to spawn multiple CACHE.EXE processes.

There are several ways of exporting data from Caché - exporting classes, which only exports the class definition (in fact, the table definition) and exporting the table itself to text, which exports the contents.

The multidimensional array view of Caché reminds me somewhat of dictionary and array types in languages such as Ruby and Python, while the untyped data elements are also used in SQLite. Arrays can be embedded together to provide a sort of materialized view (in SQL terms) in effect.

Ultimately, the gateway to Caché's hierarchical engine is the Application Server, which takes care of the virtual machines for each of the supported languages, of the SQL interface, of the object views, and of the web/SOAP/XML accessibility, as well as providing communication mechanisms with other Caché instances and other databases (via ODBC and JDBC). The VM's access Caché tables as if they were variables.

When it comes to languages, Caché offers a BASIC variant and ObjectScript. The BASIC can be accessed from the (integrated) Studio (used for writing code) or from a (DOS-based) Terminal (used for launching code). It operates within the defined namespaces (class namespaces or table schemas). A difference from other variants of the language, which is due to the tight connection with the Caché engine, is the presence of disk-stored "global" variables, whose name is prefixed by ^; BASIC function names are actually global variables. Another difference is the presence of multidimensional arrays, similar to Python or Ruby, but which in this case are closely related to the Caché database engine (to which they are a core native feature - hierarchical databases' tables are ordered B-Trees, and these B-Trees provide the actual implementation of arrays; the SQL "tables" and OO "classes" are just views into these B-Trees/arrays); they do not have to be declared.

The array "index" is nothing else than a notation for the key of the node of the B-Tree. Non-numeric indexes are therefore possible.

Architecturally, I would be curious to know if these trees are always stored on disk, or they are cached in memory and some lazy-writer process at some point commits them to disk.



The image above - which I stole from the official docs, and modified - shows the structure of the tree; 4 is a value that the official example stored in all nodes, but any value in any node can be anything.

It can be seen that this "array" implementation actually does not need the d1 x d2 * d3 * ... dn storage for a n-dimension array.

This lack of structure allows for small size but it also can create problems at run time, especially if the consumer of the array and the producer are different; the consumer might not be aware of all the indexes/dimensions of the array. A function exists, traverse(), which can be called recursively to yield all existing subscripts.

If called with the same number of arguments, traverse() does a sibling search. An increase of the number in arguments will make it go down one level; an empty argument will yield the first index of the child (quite naturally, since you don't know what that might be at runtime). However I am still not sure how you can fully discover an array with a potentially unlimited number of dimensions, so the application must enforce at least some structure to the arrays/tables.

Now that the actual storage is better understood, it is interesting to see how these features show up in the table/class structure. What is the mechanism that allows for arbitrary indices to pop up at runtime?

A ^global variable is a persistent object of a class and a row in a SQL table; the latter are OO/relational "views" of the B-Tree/array. To answer a question from above, instantiating a new object creates it in memory; opening it (via its ID property) loads it in memory from disk. It is important to understand that an object is a row in a table. This is a sub-structure of the tree/array, e.g. ^SALARY("Julian", 36, 8) = 125000.75: ^SALARY is the entire structure, and ^SALARY("Scott") represents a different person's salary, and a different row in the table.

Does the tree's dynamic indexing means that classes are effectively dynamic as well and can be changed at runtime? Not really. Neither does the SQL table structure change to reflect changes in the underlying array.

As it can be seen, the value of the global (^USER) is a pointer to the index of the first element, which also is the $ID column of that row.




Interestingly, adding a ^USERD(2, "Dummy") creates an empty record in the table, and adding a ^USERD(2) actually populates the record. So the second level in the ^USERD(2) does not actually show in the table at all. Is this child the next table in the hierarchy?

Mapping the other concepts, the class' package does become the database schema. Creating a table or a class does not instantiate the ^global (array), that only happens when data populates the array. The array's name becomes package.classNameD.



ObjectScript is another language supported by Caché. It is available from the Terminal (one of the ways of interacting with Caché, besides the Studio and the System Management Portal), where you can directly issue ObjectScript commands - you use ObjectScript to launch Basic routines stored in the system. Commands can be abbreviated, which unfortunately makes for unreadable code, as the MUMPS example at Wikipedia shows (it compiled fine in Studio!).

ObjectScript is also an untyped language, allowing for interesting effects such as this:

> set x = 2 --> 2
> set x = 2 + "1a" --> 3, since "1a" is interpreted as 1

System routine names are preceded by %, and routine names are always preceded by ^ as they are globals. Routines can be called from specific (tagged) entry points by executing DO tag^routine. The language is case- AND space-sensitive.

Creating a class also creates ObjectScript routines, which, as far as I can tell, deal with the database persisting operations of the class. for allows for argument list iteration, (similar to Ruby?). It supports regular expressions (through the ? pattern), a fairly robust support for lists, and an interesting type named bit-string (similar to BCD?).

Routines are saved with the .mac extension.

Creating a ^global variable in ObjectScript in Terminal makes it visible in the System Management Portal under "Globals". However, this does not create a table available in SQL.

"Writing" a global only renders that particular node, e.g. ^z is not the same as ^z(1) (the zwrite command does that). However, killing ^z removes the whole tree.

It can be seen that, not unlike with XML (node values vs. attributes), data can be stored in nodes (^global(subscript) = value), or in the subscripts themselves.

There are a couple of handy packages that let you run Oracle/SQLServer DDL to create Caché tables.

There is a lot more about the OO/Relational features of Caché that I have not covered; e.g., it is possible to create objects with hierarchies in ObjectScript, or have array properties of classes, that become flattened tables or related tables in SQL. More details here, with the caveat that Reference properties appear a referential integrity mechanism of sorts which could perhaps have been implemented more "relationally" through foreign keys (supported by Caché, but Caché SQL also supports a pointer dereferencing type of notation, e.g. SELECT User->Name; I am not sure how useful that is since most SQL is actually generated by reporting tools - and I don't think Crystal Reports can generate this Caché-specific SQL; I might be wrong, perhaps this is dealt with in the ODBC/ADO.NET layer).

More on MUMPS' hierarchical legacy here. On OO, XML, hierarchical (and even relational!) databases, here.

This is just a brief overview of several aspects of the Caché platform. Next I will go over the rest of Intersystems' offerings.