Business Intelligence Products and Trends: columnarBusiness Intelligence Products and Trends

Saturday, July 09, 2011

Discovering hierarchies in columnar databases

I recently blogged about columnar and Wayne Eckerson asked me for a clearer explanation of what I mean by columnar databases "discovering hierarchies".

For example consider the approach of two well known products, IBM Cognos TM1, which is multidimensional, and QlikView, which is columnar.

My definition of a data model is a structure that is informed by an administrator, or set down in the master data. To me this is different to a structure derived from analyzing the transactions. In the following simple example, let's say I have two sales teams, one for dental hygiene products and one for soap.

If I were designing a data model in TM1, then I could create a hierarchy, which is a set of parent child relationships between the departments and the products they sell. If the soap people cross-sold some toothpaste, it would have no effect on the hierarchy, because it is predetermined by my idea of how my company is supposed to work.

If I were to import the same data in QlikView I could create a report that showed me the relationship between the sales teams and the products without defining the model. Once the data is imported, QlikView recognizes the relationships automatically.

When the soap guys cross-sell toothpaste, QlikView discovrs the new relationship, but the hierarchies stay the same in TM1, because that's how I defined the model. To me this is the key difference. On the one hand the structures are coming directly from the actuals, and on the other hand they reflect my predefined perception (or "model") of what is going on.

So columnar databases typically discover the relationships automatically, and multidimensional databases allows you to define the relationships as you want them. Another way to look at this is that the transactional data drives the master data structure in a colunmar database, but those structures are wired into the multidimensional model.

So which approach is better? It depends on the application.

Thursday, June 30, 2011

Column oriented databases are not the same as in-memory databases

In recent years, thanks not least to aggressive marketing by QlikTech (or Qlik Technologies as the are now often called) Tableau and Tibco Spotfire, columnar databases and in-memory databases have become very fashionable. Microsoft's VertiPaq engine, which is behind the PowerPivot product, is a good example of a tool that came in on the wave of this trend.

One of the results of these is that there seems to be some confusion about what the terms "in-memory" and "column oriented" mean, and attributes of one are often attributed to the other.

Just to be perfectly clear: A columnar database is not necessarily in-memory, and an in-memory database is not necessarily columnar.

In-memory is a somewhat vague term, since, as Nigel Pendse likes to point out, all databases have to hold data in memory to process it -- the CPU cannot directly access the hard drive. However, I would say that unlike some other tools, IBM Cognos TM1 and QlikView are in-memory. These products load everything into memory before they do anything. If there is not enough memory to fit the entire data set, the load fails and that's that. The same applies to SAP HANA. But unlike QlikView and HANA, TM1 is a multi-dimensional database.

The loading behavior of an in-memory database is much different to the MOLAP engine in Analysis Services, which is fundamentally disk-based but has sophisticated paging abilities to keep as much as the data as possible in memory, or the column oriented Spotfire, which attempts to load everything but uses paging if there is not enough memory.

Columnar is a much clearer and simpler term. It simply means that the data is stored by column instead of by row. There are a large number of analytic databases with this architecture, such as Exadata, SAND, Greenplum, Aster, or Sybase IQ, just to name a few. Some, like Vertica and VertiPaq, even refer to their columnar architecture in their names. Some columnar databases are in-memory, but many are designed to deal with huge amounts of data, up to the petabyte range, and cannot possibly hold it all in memory.

By the way, what got me off on this rant is actually this blog about Endeca Latitude 2 which actually equates the two technologies, and a Linked-In discussion the author started (which is private, so I can't link it here) with the title "Is Data Modeling Dead?"

The idea in memory databases kill data modelling comes from the fact that columnar databases are often used to discover hierarchies, and a whole generation of so-called "agile" in-memory database tools use this method. But in-memory multi-dimensional databases are still around and still very useful for analyzing data on well defined structures such as financial data.