Friday, July 31, 2009

dbMotion


  • similar to: partially, Intersystems


dbMotion presents itself as a SOA-based platform enabling medical information interoperability and HIE. It is made up of several layers (from data integration, the lowest, to presentation, the highest) which are tied together and present to the exterior a 'unified medical schema', a patient-centric medical record. A business layer does data aggregation.

There are also a few other components such as shared services (which deals with, among others, unique patient identification). UMS is based on HL7 V3 Reference Information Model. Other features include custom views into the data, data-triggered events, and an EMR gateway.



As I understand it, without having seen it in an actual deployment, dbMotion's offering is similar to Intersystems' Ensemble, without the underlying infrastructure (no Cache included, it relies on the user's database), but with the HealthShare component (so it offers healthcare-specific application infrastructure, whereas Intersystems' offerings are more segmented). What would be the benefit, compared to Ensemble? It does not take a whole Cache installation so it might (?) be cheaper, and the dev skills might be more widespread; it also is more mainstream-RAD. It seems to be a solution for patching together an existing infrastructure, whether my feeling about Ensemble is that it would perhaps work best with a brand new setup.

Interestingly enough, dbMotion is developed using the Microsoft stack, and the company is in fact a Microsoft partner.

What I don't quite get from the description is how does HL7 interfacing work with dbMotion - UMS is (perhaps logically) based on the (XML-based) HL7.v3 RIM, but is there a conversion mechanism to the other versions? How about v2 endpoints?

Oracle

  • similar to: IBM, Microsoft

As far as I can tell, other than platform offerings, Oracle's only specific healthcare product is Transaction Base, a IHE solution. While the full spec is here, my initial assessment is that it would make sense in an environment with an already significant Oracle investment. There is a life sciences product as well (Argus Safety Suite) which I believe Oracle just purchased; the other life sciences product is Clinical Data Management which deals with managing clinical trials data.

Interesting, but apparently not as exhaustive as some of the other products discussed here.

Microsoft

  • similar to: Intersystems, Oracle/IBM

Through acquisitions, Microsoft has built an impressive array of offerings in the healthcare space:

  • HIS/PACS-RIS
  • LifeSciences
  • Unified Intelligence System

HIS is pretty clear - direct competition to the Intersystems TrakCare discussed here.

UIS is a data aggregator and is somewhat similar to dbMotion and Ensemble. It integrates with HealthVault as an EMR solution.

LifeSciences is similar to Oracle and IBM offerings in that it is a suprastructure built on an existing pure technology platform that is targeted at the needs of life sciences.

Same as Oracle and IBM, Microsoft has arrived at the healthcare apps arena from the pure tech extreme - leveraging a platform into a specific vertical, quite the opposite of Intersystems, which started with an industry-specific application which it then moved (more or less) downstream as a general-purpose platform.

FairCom C-Tree

  • similar to: SQLite

FairCom is not an illogical choice to follow InterSystems; both companies' databases claim to be among the fastest on the market.

Also, both are "developers'" platforms, designed less with a general-purpose audience in mind and more with a techie audience. Both originate from successful companies that have been in business for a long time, and yet are not so well known outside tech circles.

So what are the differences?

What most people like about FairCom cTree is the access they get to the source code, which allows them to interact with the database through various interfaces, native, ADO, ODBC, etc. I guess that this is also possible with mySQL, SQLite, and perhaps PostGreSQL as well. FairCom predates (or is a contemporary) of most of these products.

Where FairCom differs from Intersystems is that its product is even less open, the cTree Ace SQLExplorer tool notwithstanding. It takes minimal admin effort and it seems targeted at turnkey or embedded systems developers, with its heavy access on C-application layer programming. You can certainly access cTree from C#, but the product is written in C and has a C developer audience in mind first; if performance is its main selling point (which makes sense: connecting from a JVM through a JDBC/ODBC bridge to, say, a remote Cache gateway which will in turn translate the code to native requests is probably akin to entering virtual machine hell), then staying close to the core system is compulsory. More on performance later.

Another thing that Cache and C-Tree have in common (but where they also differ) is that they provide different "views" into the database engine: hierarchical/sparse arrays/B-Trees in the case of Cache, C-Trees with ISAM and SQL interfaces in the case of C-Tree. Relational databases are based on, if memory serves, B-Trees (or B+ trees). However, SQL Server for example, keeps the relational engine very close to the B-Tree structure (time to review those Kalen Delaney books); in fact, I found the whole interaction between the set-based SQL and row-based processing engine quite fascinating.

Both Cache and C-Tree take a slightly different approach; the various interfaces into their storage engines are clearly provided for convenience only; back in the day, as far as I recall, Db-Lib was the library of choice for SQL Server as well (makes you wonder where does TDS live now?) The bottom line is that if you are going to use Cache or C-Tree, you should use the native interfaces; there is no other reason why you would choose C-Tree over a mainstream product such as SQL Server or Oracle, or even mySQL.

C-Tree uses ISAM as its innermost data structure; this harkens back to the mainframe days, and what it means is is that data is accessed directly through indexes, as opposed to allowing the query optimizer to decide which indexes to use (for a relational database).

As per Wikipedia, ISAM data is fixed-length. Indexes are stored in a separate tables and not in the leaves of data tables. MySQL functions on the same principle. A relational mechanism can exist on top of the ISAM structures. A more detailed presentation of the technicalities of working with the system can be found here. 

You can see more details of the structure here - how each table corresponds to a data/index file pair.


The reason I am likening it to SQLite is that it is a niche product that caters to a well-defined group: developers of embedded or turnkey systems (which is not dissimilar to who SQLite targets - remember that it is the db of choice for iTunes and Adobe AIR).

Middleware, continued (Intersystems HealthShare/TrackCare)

HealthShare is an extension of Caché and Ensemble in the healthcare vertical. It is called a 'health information exchange', and I am going to try to see if this matches what HealthVault and GoogleHealth are attempting to be; terminology can be tricky. 'EHR' comes up frequently when describing HealthShare, but is it just an EHR data platform? Or a HIS?

A consortium called IHE exists, affiliated with HIMSS, which attempts to establish interconnectivity standards for the healthcare IT industry. It documents how established standards (DICOM, HL7) should be used to exchange information between clinical applications. HealthShare operates along those lines.

IHE does things such as this:


A cardiology use case

Or as this:



Actors and transactions involved in an electrocardiography IHE "profile"

(images from the IHE Cardiology Profile documentation)

HealthShare organizes data from clinical providers and makes it accessible to clinicians via a web browser. Although it does store some data locally and performs some data transformations, essentially it is a repository of clinical data from one/multiple providers. A similar product that I can think of is Amalga UIS. Some of the components first introduced in Ensemble (gateways) are in this case used to provide connectivity to the various clinical information sources. HealthVault would be the equivalent of the HealthShare Edge Cache Repository, a store of shared data defined at each clinical data provider's level.

Another component is the Hub, developed in Ensemble, which connects all the data sources together and among others performs patient identification - something which I am too familiar with. I am curious how the Hub is updated (event-based, day-end process?)

Edge Cache can replicate some or all of the clinical data from the original sources. At the minimum, it requests data through the gateways of the original sources, at the request of the Hub. It therefore serves another role that I am quite familiar with, that of a backup system for the HIS or practice management system.


(image from the official HealthShare docs)




TrackCare is a web-based HIS; (un?) surprisingly, just like Amalga, it is not available in the US. It covers both financial and clinical apps. It is built on top of Ensemble. Since it is a full-fledged HIS, its description is beyond the scope of this post, but can be found here.

The whole Intersystems portfolio of applications can be depicted as follows:



I will try to use this model when dealing with other vendors as well.

A few concluding remarks:
  • this is an integrated stack; you just need the OS and it gives you a storage system, an application development environment
  • however, the app dev environment isn't for the faint of heart, the VB-inspired offering notwithstanding; and some of the other languages offered are somewhat unusual by today's standards - but this is a throwback to the system's 60's roots; it must perform quite well in fact, since it has not gone the way of COBOL! (anyone really uses Object-COBOL?)
  • the above makes it less known than, say, MS Visual Studio - but the environment is in fact targeted at specialized business developers and not at a mass audience
  • in the verticals that it targets (healthcare, finance) it seems to do quite well - Intersystems, the flagbearer for MUMPS, has been in business for over 3 decades
  • my question would be why there isn't an offering for finance (similar to the healthcare solutions) - perhaps the industry is much more fragmented than healthcare?
  • so the vendor's strategy in this case (Intersystems) is to offer a platform, a development environment, and a foray into an industry vertical. I am not sure which came first (apparently, all at the same time! if you read the history behind MUMPS), while, as we will see, other vendors' route has been different.

Thursday, July 30, 2009

Middleware, continued (Intersystems Ensemble)

Incidentally, I started a project that will test different databases' performance. The (yet, incomplete) version is here, and the C# class that specifically deals with Caché interaction is here (a subclass of the main database class). The I/O ops are very simple, only dealing with 2 fields.




Ensemble
is a RAD platform; it allows users to create workflows, RIA's, and rule-based logic (hence, it can work as an interface engine). It contains an application server, workflow server, document server all in one - not surprising, given the Caché platform's own relatively wide array of offerings, on which Ensemble is based. As far as I can tell without having seen the product, it is really a set of extensions built in the Caché environment to provide messaging, workflow, and portal services, with some industry-specific features such as HL7/EDI, and endpoints for BPEL applications, database access, and other EAI connectors. Ensemble also offers data (SSIS-style? not too difficult to understand, and resulting in federated databases, as already implemented in the Caché application server external database gateway) and object transformation (Java --> .NET ORBing? I am not sure how is this done, I assume through instantiating VM's for each of the supported platforms and performing marshaling between the objects).

I assume that messaging is implemented in the Caché application server - not entirely different from the original MSMQ.

As far as RAD capabilities (as so far I have mostly talked about infrastructure), Ensemble offers some graphical code generators for BPM; I am assuming it also supports the Caché development environment (ObjectScript, MVBasic and Basic).

In Microsoft terms, Ensemble is basically VS + SQLServer + SSIS + WCF + WWF + BPEL parser + BizTalk + customizations. In bold, the middleware stack.


On closer inspection, it appears that the inter-VM object conversion is in fact introspection- and proxy-based instantiation of .NET objects which are made available by Ensemble to Caché's native VM's. Ensemble runs a .NET VM which can execute .NET objects natively through worker threads. I am curious if this requires a Windows Server to be available at runtime - not sure how distributed can the Ensemble installation be.

Middleware (Intersystems Cache)

...as in connected systems, the original title to which this blog just reverted. I just noticed lately that there has been a bewildering proliferation of offerings in this space, some having to do with the Cloud, some having to do with verticals such as health care. So I will try to make sense of some of these things next.

Intersystems offers a database platform (Caché), a RAD toolset (Ensemble), a BI system (DeepSee), and two products specifically targeted at healthcare, an information exchange platform (HealthShare) and a web based IS (TrakCare). So if I was to put everything in a matrix comparing different vendors, it would almost have to be a 3D one - one dimension would have to cover the platform, and another (the depth) would have to cover the vertical (healthcare), as for example, Microsoft offers both the platform and the vertical app.

Caché is a combination of several products, some of which originate in MUMPS, which is a healthcare-specific programming language developed in the 1960's. MUMPS used hierarchical databases and was an underpinning of some of the earliest HIS developments (Wikipedia is our friend); at some point it ran on PDP-11, which incidentally was the first computer I did ever see.

It makes one wonder what would have happened had MUMPS became the database standard as opposed to what would become Oracle, as MUMPS predates R2 (and C, by the way). But the close connection between the language and the database, which might strike some today as strange, goes back to its origins.

Caché's performance stems from its sparse record architecture, and from its hierarchical (always-sorted) structure.

Caché has been modernized to provide a ODBC-compliant interface (and derivatives: ADO.NET) and an object-oriented 'view' of the data and functionality embedded in MUMPS (ObjectScript). The development environment also offers a BASIC-type of programming language and a website construction toolkit, quite a lot for one standalone package.

It seems that Caché is a document-oriented database, which would make it similar to a XML database in some ways - the main 'entities' are arrays in one case, nodes in the other as opposed to relational tables.

At the same time, for a hierarchical database, Intersystems somewhat confusingly portrays it as an "object" database, which is probably not technically incorrect, since one of the views of the data is "object"-based as I mentioned above.

Creating a class in Caché also creates a table accessible in SQL (via the database interfaces, or through the System Management Portal). The table has a few additional fields on top of the class' properties - an ID and a class name (used for reflection, I assume). The System Management Portal also provides a way to execute SQL statements against the database, although at first sight I cannot seem to create a new data source in Visual Studio - and have to access the data programmatically.

One of the ways using the database from Microsoft Visual Studio requires the use of a stub creator app - CacheNetWizard, which failed every time I tried to use it. The other is to use the Caché ADO.NET provider:

command = new CacheCommand(sql, cnCache);
CacheDataReader reader = command.ExecuteReader();
while (reader.Read())
{
noRecsRead++;
if (noRecs > 0 && noRecsRead >= noRecs)
break;
}

Running a large operation (a DELETE, in this case) from one client seems to spawn multiple CACHE.EXE processes.

There are several ways of exporting data from Caché - exporting classes, which only exports the class definition (in fact, the table definition) and exporting the table itself to text, which exports the contents.

The multidimensional array view of Caché reminds me somewhat of dictionary and array types in languages such as Ruby and Python, while the untyped data elements are also used in SQLite. Arrays can be embedded together to provide a sort of materialized view (in SQL terms) in effect.

Ultimately, the gateway to Caché's hierarchical engine is the Application Server, which takes care of the virtual machines for each of the supported languages, of the SQL interface, of the object views, and of the web/SOAP/XML accessibility, as well as providing communication mechanisms with other Caché instances and other databases (via ODBC and JDBC). The VM's access Caché tables as if they were variables.

When it comes to languages, Caché offers a BASIC variant and ObjectScript. The BASIC can be accessed from the (integrated) Studio (used for writing code) or from a (DOS-based) Terminal (used for launching code). It operates within the defined namespaces (class namespaces or table schemas). A difference from other variants of the language, which is due to the tight connection with the Caché engine, is the presence of disk-stored "global" variables, whose name is prefixed by ^; BASIC function names are actually global variables. Another difference is the presence of multidimensional arrays, similar to Python or Ruby, but which in this case are closely related to the Caché database engine (to which they are a core native feature - hierarchical databases' tables are ordered B-Trees, and these B-Trees provide the actual implementation of arrays; the SQL "tables" and OO "classes" are just views into these B-Trees/arrays); they do not have to be declared.

The array "index" is nothing else than a notation for the key of the node of the B-Tree. Non-numeric indexes are therefore possible.

Architecturally, I would be curious to know if these trees are always stored on disk, or they are cached in memory and some lazy-writer process at some point commits them to disk.



The image above - which I stole from the official docs, and modified - shows the structure of the tree; 4 is a value that the official example stored in all nodes, but any value in any node can be anything.

It can be seen that this "array" implementation actually does not need the d1 x d2 * d3 * ... dn storage for a n-dimension array.

This lack of structure allows for small size but it also can create problems at run time, especially if the consumer of the array and the producer are different; the consumer might not be aware of all the indexes/dimensions of the array. A function exists, traverse(), which can be called recursively to yield all existing subscripts.

If called with the same number of arguments, traverse() does a sibling search. An increase of the number in arguments will make it go down one level; an empty argument will yield the first index of the child (quite naturally, since you don't know what that might be at runtime). However I am still not sure how you can fully discover an array with a potentially unlimited number of dimensions, so the application must enforce at least some structure to the arrays/tables.

Now that the actual storage is better understood, it is interesting to see how these features show up in the table/class structure. What is the mechanism that allows for arbitrary indices to pop up at runtime?

A ^global variable is a persistent object of a class and a row in a SQL table; the latter are OO/relational "views" of the B-Tree/array. To answer a question from above, instantiating a new object creates it in memory; opening it (via its ID property) loads it in memory from disk. It is important to understand that an object is a row in a table. This is a sub-structure of the tree/array, e.g. ^SALARY("Julian", 36, 8) = 125000.75: ^SALARY is the entire structure, and ^SALARY("Scott") represents a different person's salary, and a different row in the table.

Does the tree's dynamic indexing means that classes are effectively dynamic as well and can be changed at runtime? Not really. Neither does the SQL table structure change to reflect changes in the underlying array.

As it can be seen, the value of the global (^USER) is a pointer to the index of the first element, which also is the $ID column of that row.




Interestingly, adding a ^USERD(2, "Dummy") creates an empty record in the table, and adding a ^USERD(2) actually populates the record. So the second level in the ^USERD(2) does not actually show in the table at all. Is this child the next table in the hierarchy?

Mapping the other concepts, the class' package does become the database schema. Creating a table or a class does not instantiate the ^global (array), that only happens when data populates the array. The array's name becomes package.classNameD.



ObjectScript is another language supported by Caché. It is available from the Terminal (one of the ways of interacting with Caché, besides the Studio and the System Management Portal), where you can directly issue ObjectScript commands - you use ObjectScript to launch Basic routines stored in the system. Commands can be abbreviated, which unfortunately makes for unreadable code, as the MUMPS example at Wikipedia shows (it compiled fine in Studio!).

ObjectScript is also an untyped language, allowing for interesting effects such as this:

> set x = 2 --> 2
> set x = 2 + "1a" --> 3, since "1a" is interpreted as 1

System routine names are preceded by %, and routine names are always preceded by ^ as they are globals. Routines can be called from specific (tagged) entry points by executing DO tag^routine. The language is case- AND space-sensitive.

Creating a class also creates ObjectScript routines, which, as far as I can tell, deal with the database persisting operations of the class. for allows for argument list iteration, (similar to Ruby?). It supports regular expressions (through the ? pattern), a fairly robust support for lists, and an interesting type named bit-string (similar to BCD?).

Routines are saved with the .mac extension.

Creating a ^global variable in ObjectScript in Terminal makes it visible in the System Management Portal under "Globals". However, this does not create a table available in SQL.

"Writing" a global only renders that particular node, e.g. ^z is not the same as ^z(1) (the zwrite command does that). However, killing ^z removes the whole tree.

It can be seen that, not unlike with XML (node values vs. attributes), data can be stored in nodes (^global(subscript) = value), or in the subscripts themselves.

There are a couple of handy packages that let you run Oracle/SQLServer DDL to create Caché tables.

There is a lot more about the OO/Relational features of Caché that I have not covered; e.g., it is possible to create objects with hierarchies in ObjectScript, or have array properties of classes, that become flattened tables or related tables in SQL. More details here, with the caveat that Reference properties appear a referential integrity mechanism of sorts which could perhaps have been implemented more "relationally" through foreign keys (supported by Caché, but Caché SQL also supports a pointer dereferencing type of notation, e.g. SELECT User->Name; I am not sure how useful that is since most SQL is actually generated by reporting tools - and I don't think Crystal Reports can generate this Caché-specific SQL; I might be wrong, perhaps this is dealt with in the ODBC/ADO.NET layer).

More on MUMPS' hierarchical legacy here. On OO, XML, hierarchical (and even relational!) databases, here.

This is just a brief overview of several aspects of the Caché platform. Next I will go over the rest of Intersystems' offerings.

Thursday, July 23, 2009

Open Source, Cloud-based Approach to Describing Solution Architectures

Mike Walker discusses in a recent issue of the Microsoft Architecture Journal a set of tools that can be used to document solution architectures - based, not surprisingly, on Microsoft tools. Together, these make up the Enterprise Architecture Toolkit.

Since I don't have a Windows Server to run Sharepoint (I could, presumably, use Azure), I came up with a similar application setup using open source or cloud-based tools:




The only thing that needs to be built is the manager ("gateway", in the chart above) which can be a RIA application whose role is to tie everything together. Sounds simple enough?

Sunday, July 19, 2009

Mobile EHR

It makes sense for mobile carriers to get involved in the EHR arena. However, the way I see this done is via cloud-stored EHR info that is accessible through a handset; how else would you carry the record should you decide to move to another carrier?

I still think it is far fetched for a mobile carrier to roll out an entire HIS application though. There are so many verticals (all, practically) that make use of mobile communications one way or another, should mobile communications providers create solutions for everything?

And a 'global' mEHR, while a nice idea indeed, I think will be always hindered by competing standards and lack of acceptance - after all, even the mobile infrastructure worldwide is fragmented, CDMA, GPS, etc. Why would the application layer be any different?

Worth keeping an eye on though.

Wednesday, July 15, 2009

Google Maps knows where you are

This is way cool: if you connect to the Internet using WiFi, Google Maps 'knows' where you are and shows your location by default.

Slowly it is all coming together - the 'cloud' means that you can keep your data (and processes!) in one place, and you can access it (via WiFi) from anywhere, even using a lightweight client. Also both the client and the cloud backend 'know' where you are so functionality can be tailored to the time/location.

I'm not sure how much computing power is needed on the (portable) client - probably, only enough for rich media rendering. Other than specialized applications, most that an average user really needs should be easily done using a client that combines media/communication/lightweight computing services. I don't think iPhone is there yet (as the all-purpose 'client'), but perhaps a combination of iPhone and Kindle, three versions from now, might become just that.

Tuesday, July 14, 2009

XDb


Documentum/EMC offers a XML database named XDb. XProc is in fact designed to work with XDb.

Perhaps XML database is a misnomer. It really is a way of storing XML documents, without (apparently) enforcing any relational integrity constraints other than those defined by the DTD (and perhaps XLink, athough so far I don't know if XLink is declarative only). Therefore XDb and XProc work hand in hand, one allowing for the storage of XML documents, the other allowing for manipulation of those documents (and perhaps, in-place updates).

The logical design is therefore done at a superior level. The 'database' concept appears to function when various stored documents are manipulated as sets - XDb supports XQuery (preferred), also XPath and XPointer.

Each XML document is stored as a DOM.Document and can be manipulated using the standard methods (createAttribute, createTextNode, etc).

I can see a possible usage in, for example, GoogleHealth - where XDb would store well-formatted templates for charts, diagnoses, allergies, vaccines, etc, which would be populated for each patient encounter and loaded into GH.

While in normal usage write contention should not be an issue, I am curious how does XDb deal with document versioning and multiple writes against the same documents - or is the R/W pipe single throttled? (later - here it is - clicking Refresh in the Db manager while an update process was underway yielded the following error:)



Interesting XML database reference information here.

Sunday, July 12, 2009

XProc

Documentum's XProc XDesigner - a first step towards I see as a full online development environment, although this is more similar to Yahoo Pipes. The technology is there (web-based GUI + cloud for compilation and even possibly for deployment), I think it's only a matter of finding a way of monetizing it by tool developers. Is Microsoft really making money on Visual Studio though?


Wednesday, July 08, 2009

Over mashed-up

It occurred to me that I started this blog to, well, blog about my thoughts on various aspects of computing. A while ago though, this became my testing ground for various mashups, widgets, embedded code, and so on - mainly because Blogspot Blogger allows for all kinds of code to be inserted, which Wordpress (free hosted Wordpress, that is) doesn't. Anyway, this doesn't make for a nice reading experience, so perhaps it is time I should refocus on writing and move the coding elsewhere.

An interesting experiment

And worth reading... if I can find the time.

FREE (full book) by Chris Anderson

Tuesday, July 07, 2009

Platform Convertor Strategy Analysis

A work in progress, a consulting project that discusses the positioning of a platform convertor.

Sunday, July 05, 2009

Visualization, again

Visualization of data seems to be the new new wave of BI. Already mentioned IBM's offering a few months ago, but are quite a few other players in this space, startups with interesting products, such as TableauDesktop, Omniscope, and even SAP has a product (Xcelsius), or research institutions (such as Stanford with Protovis).

What can I say: Tufte meets SQL. And perhaps Processing should get in the game - surprised I haven't seen any rich visualization libraries for it - yet.

Thursday, June 18, 2009

Modeling

The RUP (rational unified process) is very nice, and so is UML. For smaller projects though, the following will do:



I would really like to know how much code is written according to diagrams. The mental image that programmers have of a problem's universe is a fascinating topic indeed - and far reaching, since how a software system works determines, ultimately, how a user has to work to accommodate the system.

Wednesday, June 17, 2009

HealthVault


Yes it does have an API, and some sample apps. The .NET samples include a web application to talk to the service - however, I give up on it for the time being as the utility to make certificates seems to crash all the time (nice unhandled error, by the way; the crash seems related to the fact that the app is installed in Program Files as opposed to Documents, and Visual Studio doesn't have full rights to PF). Will come back to it later, but so far it is remarkably similar to Google Health.


One additional thing, the SDK features some device drivers to enable medical devices to talk directly to HV. Nice - as long as they don't cause any crashes...

Tuesday, June 16, 2009

AIR and Google Health

Recently I've been tinkering with AIR and Google Health (GH). It's been surprisingly easy, if one can overlook the endless stream of XML returned by GH - but there is no other way, HL7 would be just as nasty looking. I don't know yet how it returns the file/image data that can be attached to the GH account.

AIR seems an ideal environment to build a desktop client to front a GH cloud-based application: it's lightweight, Javascript-compatible, and portable across platforms.



Speaking of, it seems that AIR will be ported to mobiles as well. I would argue that the paragraph above (and not just the stronger OO features found in Flash and available to AIR) is a strong reason to do this port, although I am not sure how easy is to develop and maintain AIR applications, and also I am not sure how well do these applications perform given the several layers of virtual environments they have to execute in.

Will write more thoughts as soon as I finish the small scale project I am working on right now, 4-5 forms of reduced complexity (but with a significant amount of functionality built in; the underpinnings of this relatively simple project are amazingly complex and would have been hard to imagine only a few years ago).

I haven't looked at Google Tables yet, but read some things about YQL and can see a scenario where medical (and other) personal information (e.g. reverse phone lookups, credit history, white and yellow pages) is queriable over the web via a SQL-type of language with the right security in place. In fact, the infrastructure already exists! So it would be just a mater of connecting the pipes.

And, I haven't even started to look at HealthVault's API (if it has one).

Wednesday, June 10, 2009

Monday, June 01, 2009

LinkedIn widget


My LinkedIn Profile.

Donkey Kong

I never really played this game in the 80's - it seemed to be available only on computers I did not own, such as the C64. I can, finally - will someone make a Sentinel widgety game available please?



Whatever one thinks of video games, I find it amazing that today you can run what was once a significant programming effort in a 'virtual' OS through several layers of interpreted code (widget > flash > browser > OS process > ...). I wonder how similar is the machine code ultimately generated on the OS to the machine code of the original program :)

Tuesday, April 14, 2009

Slideshare

Some of the academic presentations I have worked on can be found here.
View my profile on slideshare

LiveEarth

Sunday, April 05, 2009

More tweets visualizations

For quite some time now I have been finding visualizations cool. There is a whole list of blogs and web sites dedicated to this rather obscure area of - computing? Web 2.0? It's an Edward Tufte-meets-open source type of thing... and even (SF) author Bruce Sterling is in on the game. And now, even IBM: they too are visualizing tweets (real-time Internet seems to be the in thing now). Still unsure about the usefulness of it all, but it makes for nice mind-map-like charting.

Thursday, March 26, 2009

Visualizing tweets live

I'm not sure how useful it is, but it's certainly cool: Twittervision.

Sunday, March 15, 2009

Virtual Worlds in Asia

Check out this SlideShare Presentation:

Sunday, February 08, 2009

About me....

Or rather, about embedded Google presentations:

Friday, February 06, 2009

Flight Stats widget

Flight Status
By Flight or Route

examples: CX 709 or JFK to LHR
Don't Know the Code?

Tuesday, February 03, 2009

BBC World Music Widget

Who knew that the BBC was so cool?

Tuesday, December 02, 2008

Semantic space

A semantic space explorer. Unfinished, but a great metaphor for exploring all kinds of databanks and databases.

Monday, November 03, 2008

Pricing of information

An interesting piece of research from one of my professors from INSEAD, Markus Christen: it seems that it is more profitable for information providers to offer lower quality information to their clients. This will force their clients to use multiple providers to arrive at a 'truth', and will also allow them to raise the prices of the (unreliable) information they offer. This works if there is a low correlation between the information offered by the providers, and if there is a relatively low number of providers.
Reminds me of IDG, Gartner, et al. Which incidentally are among the examples mentioned in the paper.

Friday, September 26, 2008

Reality?


Not sure what to think about this, other than having my vanity reach new levels as I look at my face in a Blade Runner-esque virtual environment... thanks to ExitReality. The navigation metaphor is ok-ish, I had a similar idea for exploring databases; but I'd guess that most of the information-rich web sites aren't amenable to this kind of visualization.

Super cool toy, though.

Monday, August 18, 2008

Friday, August 15, 2008

Moolah

Nice, Wordpress. To redirect a WP-hosted blog, one must pay $10/year. To use a custom CSS (I hate the font used by the current one I have there), it's another $15. I understand they have to make a living. They should understand I can use Blogger instead.

Applets once were embeddable

Ok most of my recent posts have actually been embedded scripts or badges from various external sites; long live mashups! Today however, I had to deal with a problem I thought had been fixed since 1996: embedding a Java applet in a web page. Such a simple task was made unpleasant by the need to have it done in XHTML and supported across various browsers, including IE, Firefox, and Safari.

Well, the OBJECT tag works slightly different in each of the above. And it is extremely poorly documented. If you want your JAR to reside in other directory than the one where the HTML file is, tough luck.

This is plain stupid. Why is it so hard to standardize such a basic premise of the internet? Ok Java applets aren't used that much anymore, but why did they have to go and make something that was once simple, complicated?

Friday, June 13, 2008

Landing page project for my personal web site

Courtesy of Gliffy.

Work in progress:
...here

Image:


Editable chart:


Monday, June 09, 2008

Alexa Site Compare widget

...cannot be pasted here apparently. So it can be found here.

Sunday, June 01, 2008

Twitter's db maint


...is down for database maintenance. DBCC I guess? A bit funny to see yet another reason why the service is down. I can think of several ways to monetize it, but certainly not with the amount of downtime they are having.

Saturday, May 24, 2008

Widgets ecosystem

Allright, widgets are the new rage. Here are your options:

Then there are the 'social sites' plug-in widgets (eg for Facebook: Washington Post's collection of widgets). Also, offerings such as this one that provide the backdrop for widgets to work. Or, for a really useful collection of widgets, look here.

Finally, there are aggregator sites such as these: WidgetBox and SexyWidget.

A cool company that authors widgets.

Saturday, March 22, 2008

Blogger API

Using the Blogger API: of course, from JavaScript you are restricted by the same domain origin security requirement. Google has a nice API that allows JS to work... still trying to figure out a way to use this feature. The irony is that the demo I link to reads this very postings.

More programmable web

Lately the 'programmable web' has become more and more of a reality. Amazon, Google, and Adobe among others have made big contributions. Here are some:
  • Amazon S3
    • REST/SOAP based storage
  • Amazon SQS
    • SOAP based queuing
  • Amazon SimpleDB (still in beta)
    • similar to GoogleBase?
    • attribute-based semistructured repository
  • Amazon EC2
    • on demand virtual servers accessible via web services
  • GoogleBase
    • shared repository that can be queried using a SQL-type of language
  • GoogleGears Database
    • browser extension allowing the use of a local SQLite database (SQLite is just taking off: Adobe AIR also uses it)
  • GoogleGears WorkerThread
    • browser extension allowing multithreading
    • the threads are created locally; they are real OS threads spawned within the browser process
    • the threads communicate via text messages
On one hand, Adobe and Google bring web functionality to the desktop (stepping out of the browser sandbox: the AIR environment; GoogleGears); on the other hand, Amazon and Yahoo offer services that extend typical desktop or client/server functionality to the web.

I do not think the day is too far away when you will be able to compile a program online (if Office is going the software as service way, why not Visual Studio?) and deploy it to a virtual environment on the cloud, and run it there. Possibly, the only thing that is missing is a set of libraries to abstract the communication protocols (SOAP, XML-RPC, ATOM, etc... still too many).

Friday, March 21, 2008

Google Gears demo

Had some fun with Google Gears. To run this you need Google Gears installed, and it will not work with Safari.

Google Gears demo

Ruby and SQLite3 in Windows

If you need to run Ruby with SQLite in Windows, since you cannot set up the environment in the script like you can in Unix (#!), make sure you have the sqlite3.rb and sqlite3.dll in the same directory where the Ruby script is. Figuring this out caused me much grief.

Thursday, March 13, 2008

Postgres and Ruby







On Mac OS X the Postgres binaries are saved in /usr/local/bin. Here is a handy Postgres launcher that can be saved in the user’s directory (as postgres_start.sh for example):

su -l postgres -c "/usr/local/bin/pg_ctl start -D /Users/postgres/datadir"

Where datadir is the directory (owned by the postgres user) where Postgres has the data files.

To create a custom tablespace in Postgres, make an empty directory first and use that as the location for the tablespace in pgAdmin3.

Sample Ruby code to access Postgres; notice that the first line is needed to set up the environment; without it the require fails.

#! /usr/bin/env ruby
#
# original file src/test/examples/testlibpq.c
# Modified by Razvan
# Calls PL/SQL function in Postgres

require 'postgres'

def main
norecs = 0
pghost = "localhost"
pgport = 5432
pgoptions = nil
pgtty = nil
dbname = "razvan"

begin
conn = PGconn.connect(pghost,pgport,pgoptions,pgtty,dbname)

res = conn.exec("BEGIN")
res.clear
res = conn.exec("SELECT * FROM insertrt('another row')")

if (res.status != PGresult::TUPLES_OK)
raise PGerror,"RB-Error executing command.\n"
end

printf("\nRB-Results\n")
res.result.each do |tupl|
tupl.each do |fld|
printf("RB-%-15s",fld)
norecs = norecs + 1
end
end

res = conn.exec("END")
printf("\nRB-Records: %i\n", norecs)
res.clear
conn.close

rescue PGError
if (conn.status == PGconn::CONNECTION_BAD)
printf(STDERR, "RB-Connection lost.")
else
printf(STDERR, "RB-Error:" )
printf(STDERR, conn.error)
end
exit(1)
end #rescue
end #end def main

main #invoke code

This calls the following Postgres plpgsql function:

CREATE OR REPLACE FUNCTION insertrt(data character varying)
RETURNS bigint AS
$BODY$
DECLARE
id bigint;
BEGIN
id := 0;
IF EXISTS(SELECT * FROM "RTable") THEN
SELECT MAX("Id") INTO id FROM "RTable";
END IF;
id := id + 1;
INSERT INTO "RTable" ("Data", "Id")
VALUES(data, id);
RAISE NOTICE 'New id is %', id;
RETURN id;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION insertrt(character varying) OWNER TO postgres;
GRANT EXECUTE ON FUNCTION insertrt(character varying) TO postgres;


To execute this function do a SELECT * FROM insertrt( ‘parameter’ ). Interesting in the function, SELECT INTO variable. Also notice “ ‘s used to enclose field names and table names. The output of RAISE NOTICE is displayed by the Ruby console.

Since gems does not work with OS X Tiger’s Ruby, the Postgres adapter has to be built manually.

Tuesday, May 22, 2007

Back again

It has been some time since I managed to post any entries. Currently I am busy learning Ruby; not yet sure how it related to distributed systems, but since forever I have been looking for a language that would not be too syntactically twisted yet powerful enough - this little scripting language might just be it. So watch out for some projects coming soon.

At the same time, I have been playing with OS X's Automator. While the choice of actions is limited (really, who would need to script iCal or GarageBand actions?), the possibilities offered by a all-pervasive workflow engine seem intriguing. I'm not crazy about Applescript, but then there is a library that enables Ruby to perform Applescript actions :D

Thursday, December 07, 2006

Roger Wolter lecture notes

A few ideas from an article by Microsoft's Roger Wolter on the MS infrastructure support for reliability in connected systems 9 (in MS Architecture Journal, vol 8):

- SOA = connected systems
- services communicate through well defined message formats => reliability = reliability of the communication infrastructure
- message handling between services more complex than client/server because server must make client decisions
- message infrastructures in the MS world: MSMQ, SQLS Broker, BizTalk (also offers data transformation)
- problems:
1. execution reliability (handling volume): different technologies deal with this differently, e.g. stored procedure 'activation' in Service Broker
2. lost message (communication reliability)
3. data reliability

Monday, November 27, 2006

Quartz Composer

For the last few days I have been experimenting with Apple's Quartz Composer. While this is primarily a motion-design tool (and a very powerful one indeed), it is also an example of a very effective graphical programming environment. Prior to this, I had seen such tools in the Windows environment and was less than impressed, but QC is really amazingly powerful; you can parse structures, use variables, loops, and everything somehow fits together very well. I see uses for this metaphor in the BPEL world, at the very least, but the whole world of distributed computing seems a good fit for it.


By the way, this is the 'code' behind one of the rather phenomenal demos that can be found here.

Thursday, November 02, 2006

Web OS part II

To further illustrate the merging of Web, Database, and fileshare services, I'm currently involved in a Sharepoint installation process where data from users' file shares will be moved to the Sharepoint collaborative environment, with a SQL Server database as physical storage. Thus, the Internet replaces the file storage functionality, by delegating the actual storage to the SQL engine.

Saturday, October 21, 2006

MacOS run loops and console mode

Run loops do NOT run automatically in console mode, not even on the main thread. It kind of makes sense, run loops are one of the mechanisms that support GUI events. So you have to create a run loop manually when running in console mode; it can use multiple timers, and it will pre-empt the main thread, whose execution will only resume after the run loop's timers finish running.

Sunday, October 15, 2006

Adventures in multithreading

An insidious race condition arises in the following situation (which I encountered in Objective-C, but any language that passes by reference will allow for the same):

I have a consumer function which writes a message to a file or database and which can be called by multiple threads - so it is LOCKed. This function is called by threads generated by a loop, where the thread instantiation function takes as parameter a string which is modified by each loop iteration. E.g., in pseudo-C:

string msg;
for( i = 0; i < 10; i++ ){
fsprintf( msg, "parameter: %i", i );
launchThread( msg );
}

void launchThread( string parm ){
plock lock;
fprintf( fHandle, parm );
plock unlock;
}

Of course since the fprintf needs to be atomic (in order not to generate a bus error), it is the one that has to be locked. However, msg is a shared resource as well. If you run this code as it is you will get an output similar to the following:

parameter20
parameter20
parameter30

Instead of the expected:

parameter1
parameter2
parameter3

That is because msg is modified by the loop and by the time thread #x has picked it up, who knows what value it has - certainly not one in sync with #x.

The solution is to provide as parameter to the thread a full immutable copy of msg and not a reference to msg.

Monday, September 18, 2006

Again, GoogleMaps

..it seems it's an issue of latency. From a really fast connection it was able to recognize both Tokyo and Hong Kong. London is still un-geocodable though I am afraid.

Objective-C

I have recently started looking into Objective-C. In my experience, one of the biggest hurdles when learning C++ is understanding who does what; textbooks focus on OO and spend a lot of time on discussing buiding classes and coming up with silly examples, and little is said about how does this translate into executable code. With C you still have a pretty good idea how the code becomes machine code. With C++ the connection is broken; a class does not map to registries and you are left with a major gap in the continuum. The same problem (even to a larger extent) occurs with SQL or VM-type languages such as C#, Java, or Actionscript.

The ObjC instructions from Apple are the only ones I have seen so far that do a good job at explaining the runtime, and how OO constructs become procedural code. I am very impressed.

I have a first attempt at writing Mac OS X code here, a Cocoa front end to Unix queues. IPCS seems to not work with queues on Mac, but other than that the system calls seem to work quite well.

Friday, September 15, 2006

GoogleMaps v2

I finally rewrote the Maps application in a more objectified JavaScript. The source code is here and you can see it in action here. It seems that the Google Geocoder also does not recognize London, besides Hong Kong (actually, HK is sometimes recognized! what gives?) and Tokyo.

Flickr (FlickrMaps) uses it in a similar fashion.

Here is the UML Sequence Diagram of the interactions caused by this application:


Thursday, September 14, 2006

SQLite

...is very easy to use. To use from C (assuming gcc is the compiler):

- #include sqlite3.h
- compile like this: gcc -lsqlite3 file.c
- you really only need 3 API functions, sqlite3_open, sqlite3_exec, sqlite3_close
- for sqlite3_exec, you need to provide a callback that takes the number of columns, the name of each column, and the value of each column; the callback is called by the library for each component of the resultset
- you can create a database using 'sqlite3 database_name.db' from a shell prompt.

Monday, August 28, 2006

GoogleMaps

Since I will soon have my 'summer' vacation, here is a link to my first GoogleMaps app: a list of places I have been to. It's a V1 thing, hampered by my inadequate JavaScript skills, that I promise to improve.

I'm kind of surprised though that the Google Geocoder does not seem to recognize (cities in) Japan and China?? I even tried their online demo and it cannot find these two locations.

The API is here.

Of databases and connections

A few notes to self regarding SqlClient and OleDb connections:
  • link between program and data

  • opening a connection is expensive

  • hence connection pooling: ADO does not destroy the connection object even after you close it

  • the connection is kept in a pool

  • it is destroyed after a time out interval (60 seconds => disconnect in SQL Trace)

  • reusable connection: which matches the connection details(data store, user name, password)

  • to turn connection pooling off in OLE DB: append to ConnectionString 'OLE DB SERVICES = -2':

    • significant differences: 6 seconds for 100 000 connections to ....?

    • not using this leaves the connection in SQL logged in at the time of the initial logon even after it is closed and reopened in code

    • the connection disappears from Activity Monitor when the program exits

    • however, if a connection is closed and the program is still running, after a while it disappears from the Activity Monitor (after the time out)

    • if the connection is not closed, it stays open in the Activity Monitor

    • multiple connections are opened if a Connection.Open is issued even if they have the same authentication and data store

  • setting the connection to null/Nothing clears it from the pool (? does not seem to affect the Activity Monitor)

    • if the connection is set to nothing without closing it, it shows in Activity Monitor

    • not closing the connection causes it not to time out even when set to nothing

    • it is not clear what effect has setting the connection to Nothing/Dispose-ing in OleDb

  • in ODBC: use the control panel (how do you turn it off???)

  • using the SqlClient instead of OleDb shows the application in Activity Monitor as .Net SqlClient Data Provider

  • using SqlClient seems to keep the connection alive even after closed for longer than OleDb (does it ever time out?)

  • using OleDb shows the application as the exe not the OleDb Data Provider

Ok that is a lot of notes to self. I'm investigating this stuff: it's fairly well known but when you have to debug performance problems every little details counts and you have to be considerably more careful reading the fine print.

Which reminds me, each OS should provide some kind of relational/transactional storage service. Unix/Linux/Mac OS already does - SQLite.

Wednesday, August 23, 2006

Web Operating System?

Various web cognoscenti have been ballyhooing the 'OS' concept in a web context: Google OS. This has more to do with the coolness factor of any new software development arena than with actual functionality provided by the respective software.

The Google suite offers the following:

  • GoogleDesktop (supposedly at the core of the 'OS', and a resource hog to boot!)

  • Audacity (audio editing)

  • Orkut (social networking)

  • GoogleTalk

  • GoogleVideo

  • GoogleCalendar

  • Writely(word processing)

  • Gdrive (internet data storage)

Other than Gdrive, none of the above belong to an OS.

Leaving coolness aside, there are genuinely innovative Web-based applications whose complexity is close enough to that of desktop-based applications. For example, computadora.de 's shell is not that different from Windows 95's shell as far as the basic functionality it offers. Flash is a kind of Win32 in this case.

On the middle layer, salesforce.com is a good example of application domain functionality provided by a Web-based layer. It should be entirely possible to offer a payroll processing service.

And yes, I have a computadora.de account. You can even upload mp3's there and play them using the integrated mp3 player – which I did, an Alejandro Fernandes song, in keeping with the Mexican origin of the software.

SQL 2005 Endpoints

SQL Server can act as an application server by the means of endpoints (listeners). These can be defined over TCP or over HTTP, and support SOAP, TSQL, service broker, and database mirroring payloads.

To create a SOAP endpoint, create the stored procedures or functions that provide the functionality. Then run a CREATE ENDPOINT ... AS HTTP... FOR SOAP. Important parameters: SITE, and the WEBMETHODs collection.

A SOAP request returns an object array or a DataSet object. The default SOAP wrapper is created by Visual Studio.

This is quite nice. If you need to use a data-centric web service, just create one directly in SQL. To use it, just define the Web reference in the VS IDE; this will make an object of type SITE with a endpoint member you can access the data exposed by the SQL Server (e.g. If your SITE parameter was set to 'mySite', and the endpoint was named 'myEndPoint', you have a mySite object available which has a myEndPoint member, which exposes the functions/stored procedures defined on the SQL Server).

Monday, August 21, 2006

getUserPhotos part II

Zuardi in the previous post means Fabricio Zuardi, and he is the author of a Flickr API kit: a (REST-based) implementation of the Flickr API client in Actionscript. For some reason, he forgot/overlooked to implement one of the methods in the API, getUserPhotos.

Friday, August 18, 2006

Finally, getUserPhotos

I finally completed the missing flickr.urls.getUserPhotos from Zuardi’s Flickr library for Actionscript. It was easier than I though. Here it is:

public function getUserPhotos(api_key:String, user_id:String):Void{
var method_url:String = _rest_endpoint +
"?method=flickr.urls.getUserPhotos&api_key=";
var flickrUrlsObjPointer:FlickrUrls = this;

if(!api_key){
throw new Error("api_key is required");
}
else
this._api_key = api_key;

method_url += this._api_key;

if( user_id )
method_url += "&user id=" + user_id;
else
method_url += "&user_id=" + this._user_id;

this._response.onLoad = function(success:Boolean)
{
var error:String = "";
var isRsp:Boolean = false;

if( success )
{
if( this.firstChild.nodeName == "rsp" )
/* got a valid REST response */
{
isRsp = true;
if( this.firstChild.firstChild.nodeName == "user")
/* got a usable return */
{
flickrUrlsObjPointer._user_photos_url =
this.firstChild.firstChild.attributes['url'];

}// end usable
else
if(this.firstChild.firstChild.nodeName == "err")
/* got an error */
{
error ="ERROR CODE:" +
this.firstChild.firstChild.attributes['code'] +

"msg:" + this.firstChild.firstChild.attribute
['msg'];
}// end error

}/* end valid REST */
else
error = this.firstChild.attributes['code'] +
" msg: " + this.firstChild.attributes['msg'];
}// end Success
else
error = "Cannot load user photos: " +

method_url;
flickrUrlsObjPointer.onGetUserPhotos

(error,flickrUrlsObjPointer);

}//end onLoad

this._response.load( method_url );

}//end function



Some of his original coding is a bit grating to a perfectionist such as me :) However, I find the self reference (var flickrUrlsObjPointer:FlickrUrls ) that enables him to reach to the parent object in onLoad a nice touch. Actionscript 2 is still a mess though as far as readability.

For the unitiated, all we are trying to do is to capture the output from a REST call such as this: http://www.flickr.com/services/rest/?method=flickr.urls.getUserPhotos&api_key=404d98e10174604c8050f4f732e2162e&user_id=66489324%40N00

Via a XML object – the expected response is something like this


user nsid = “66489324%40N00” url="http://www.flickr.com/photos/zzkj/”

Tuesday, August 15, 2006

I hate embedded Crystal

These last days I had the dubious pleasure of setting up an integrated Crystal Reports xi solution. Reporting is certainly one of the less glamorous but more crucial aspects of corporate computing. This was a simple RPT to PDF converter, yet the compiled distributable clocked in at a hefty 70 MB, mostly due to the dreaded merge modules (a 150+ MB separate download; at least they seem to have fixed the annoying KeyCode bug). Not to mention that in order to get it to work with Visual Studio 2005 I had to download a Release 2 – 2 downloads of 700 and 300 MB, respectively.

Clearly, this is unacceptable. MS Reporting Services all of a sudden makes sense although that is no walk in the park either.

Most of CR’s (now, Business Objects) heft comes from, I think, the fact that it covers the entire processing workflow – connecting to data, parsing the data streams, rendering the results. It connects to a dizzying array of data sources and it may have its own internal query processor for all I can tell. In fact, there would be a nice architectural solution to this if one considers that reporting is really just the basic processing of a firehose data stream. If vendors could come up with and agree on an ODBC-type of interface, the life of report tool creators (and of programmer users) would be so much easier. Everything that Crystal connects to today, for example, could be a data store supporting a ‘reporting’ interface. Querying the data store via this interface (if binary objects can describe their supported methods via IUnknown, certainly data can describe its own structure!) would result in a XML output that could be ideally be streamed directly to a XAML visual layer.

It would be nice.

Wednesday, August 09, 2006

A web crawl algorithm

For a while now I have been very intrigued by web crawlers. After all it’s the stuff of hacker movies… So a few nights ago I came up with a nice little algorithm.

Starting with a given URL, we want to open the web page found at that URL, build the list of pages it references, and do the same for each page referenced therein. Of course, since this could potentially scan the entire www, I decided to limit the actual exploration to pages in the same domain (the other pages will be terminal nodes). And, I wanted to build an unordered list of references; so the first output would be a list of all the pages found this way, and the second a list of page pairs (unordered: if page A href’s page B and page B href’s page A, I wanted only one A,B pair to be shown).

It’s done like this: we start with two (empty) collections, one for the pages (indexed by the page’s link), the other for the page pairs (the first one is an object collection, the second, an object reference collection).

We add the root to the object collection. Then we call the root’s discover method, which:

  • builds a list of links in the document (this is the weak or rather inelegant point of the algorithm; I am using regular expressions to extract the links, and problems stem from the variety of ways in which a href can be coded in HTML: href=www, href=’www’, and there can be relative and absolute links);
  • for each link, if the respective address exists in the object collection (remember, this collection is indexed by the (object) page’s link), add the pair (the current page, at this step, the root, and the page identified by the link) to the pairs collection (if: the link does not refer to the current node, to avoid self-referential links, and if the pair does not already exist in the reverse order);
  • if the address does not exist, create a new page object, add it to the objects and to the links collection, and call this object’s discover method (unless the page points to a different domain, or to a non-readable object such as a jpeg).

Nice object recursion. Again, most of the code deals with reading the HTML stream, parsing the href’s, etc, the algorithm itself takes a mere 30 lines or so. I implemented it in .NET and after banging my head against the wall with a few limit cases, I got it to work very nicely, and a Java implementation would be just a transcription. I’ll look into porting it to Actionscript next.

Actually I could have used only one collection. Instead of inspecting the objects collection I could have inspected the pairs collection (after making it a full object collection). This would have been a more awkward and time consuming search, since each page object could be in that collection multiple times, whereas in the current object collection it is found only once.

I am not sure how else would you do a crawler (probably, the Google search index algorithm employs a similar crawling methodology) without being able to DIR the web server. Which raises the question: would a page that is never referenced by any other page ever be found by Google?

Here’s hoping this impresses Sandra Bullock (The net) or Angelina Jolie (Hackers).

Saturday, August 05, 2006

REST

There is a whole philosophy behind REST. This is worth a read. The world of software is not free of philosophies.... GOTO, Unix, open source, and now procedure calls.