Wednesday, November 11, 2009

PowerPivot as a database

A couple of people disagreed with my somewhat pessimistic analysis of PowerPivot's (Gemini) prospects. So I decided to jot down a few more details on how the product works as a database. I still stand by my original comments, but it is interesting to look at what the advantages of the product are as well.

PowerPivot is an interesting approach to dealing with the problem of data storage in Excel. Microsoft Excel users often store data in Excel as if it were a database. But Excel is not a good database for several reasons:
  • It fails to separate formats and raw data in an orderly way.
  • It does not provide an easy way to share data among users.
  • It has no straightforward means of applying calculations to more than a single cell.
As a result, there are lots of issues that come up in projects making heavy use of Excel. In particular performance, accuracy, reliability and versioning are compromised. Gemini deals with some of these issues.

Gemini addresses the issue of data storage in Excel in the most simple and obvious way. It provides a new method of storing data in an Excel file that is separate from the formatting and calculations in Excel. It does not, by default. store the data outside the Excel file. But although PowerPivot data is stored in the same file as Excel it is stored separately from the rest of the Excel data.

To understand exactly what is going on we need to take a step back and look at how the latest Office products store data. Basically the new Office format is a zip file which contains several XML files. One Excel file contains the Excel spreadsheet Excel itself. Other objects such as charts are stored separately in the same zip file as the spreadsheet file. (Actually this is a simplification. Each sheet is stored in a separate file, and there are other complications involving shared content.) The zip file has the format of a normal zip file even though it has the extension xlsx and not the .zip extension. But don't take my word for it -- you can see the exact internal format of Excel for yourself by simply opening any Excel 2007 file with a data compression program like WinRAR.

In Office 2010 it will be possible to store additional data in another separate file in that same zip file and to access that data using the database interfaces provided by Excel. These database interfaces are the pivot table and the database spreadsheet formulas. From the outside the file won't look any different -- except maybe a little bigger. Or maybe a lot bigger. The much ballyhooed column compression only works in memory.

Office 2010 also addresses the issue of sharing data between users. I guess this option is really intended for Word more than for Excel, but it works across te entire product line. In any event an Excel sheet stored on a server will be available to multiple concurrent users with a local installation of Excel or in Sharepoint. The idea is that you create an Excel file and put it on a server where all your colleagues can access it, and the product allows you to edit the file simulataneously, even updating changes made by others. This sounds like a rough and ready solution to the second problem listed above, although it does not provide any security.

The final issue that seems important to me is applying a calculations to a large number of elements. And here, too, PowerPivot does offer some respite. Calculated columns can be added to the table. This is a Good Thing because it save the user from the difficult and error prone procedure of copying the formula down the entire column. It is not a replacement for multidimensionality, because the "database explosion" effect means that you will need to define a lot of columns to imitate multidimensional aggregation. But plenty of Excel-based solutions get by without this so I guess it won't always be an issue.

The upshot is that the product really does offer departmental users some relief from Excel chaos by treating the worst symptoms. And it does so without being disruptive, a key design consideration in viewof the size of the Excel user base. So I can understand some people's enthusiasm.