Friday, April 1, 2011

Open Source Business Intelligence

Open Source Solutions have become serious alternatives to traditional proprietary licensed software with over 25 open source projects providing a wide variety of tools for data warehousing and full BI suites.  Clarise Z. Doval Santos and Joseph A. di Paolantonio have been studying open source projects related to data analytics, data warehousing and business intelligence for over five years.  This lens, with supporting blog and wiki on the subject, provides the results of that research.  This lens provides an additional tool for finding and recording information related to open source solutions for BI.  This research is sponsored by InterActive Systems & Consuting, Inc, which provides strategic consulting and project management through InterASC Professional Services,, for BI, collaboration & distributed workgroup solutions, and hosting of open source applications through the TeleInterActive Network.  Since 1995, Clarise and Joseph have worked together helping people gather data, turn it into information through analysis, and share the results through collaboration tools.

Links to OSS BI Suites 

Communities, Projects and Companies supporting OSS BI Suites

One thing with which we're struggling is how to define a BI Suite. Must it be a comprehensive, end-to-end solution? Since we don't know of any commercial BI Suite that started out as a full tilt boogie, everything from ETL to Portal solution, we're not going to demand that of open source software BI Suties. So, if a project unites more than one tool for creating a BI solution, 'tis a suite. We think.
BEE Project
BEE is one of the first open source BI Suites, having been around since 2002. It provides ETL, ROLAP, reporting, integration with the R Project, is written in PERL, and primarily supports MySQL.
JasperSoft BI Suite
The Jasper BI Suite provides a framework for report automation and ad hoc reporting, as well as full OLAP and ETL capabilities. Components include JasperReports, iReport, JasperServer, JasperAnalysis and JasperETL.
Openi provides a web-driven interface to OLAP, relational, statistical and data mining sources giving BI integrators user interface, report definition and connector tools.
Pentaho BI Suite provides a framework for a full array of capabilities: Reporting, Analysis, Dashboards, Data Integration, Data Mining and Workflow.
SpagoBI is a BI platform drawing its components from the ObjectWeb consurtium. Tools include metadata management, ETL, Reporting, Analysis, and Dashboards.

Links to OSS ETL Tools 

Communiteis, Projects and Companies supporting OSS ETL Tools

Extract, Transform and Load is often the most difficult and time consuming aspect of a data warehouse project. Tools that help the BI integrator to create, manage and maintain the rules for extraction of disparate data from multiple sources, transformation into a standard and clean data set, and the timely loading into the data repository, ODS or data warehouse is very important. Some of these tools provide EAI capability as well.
KETL is an ETL for high volume transactions developed by Kinetic Networks.
Enhydra Octopus
Enhydra Octopus is part of the ObjectWeb GForge project, providing JDBC Data Transformations
Pequel ETL
Pequel ETL is, according to their SourceForge description, a comprehensive and high performance data processing/transform system. It features a simple, user-friendly event driven scripting interface that transparently generates & executes highly efficient Perl/C code. Uses: ETL, datawarehousing, statistics, and data-cleansing.
Clover ETL
Clover ETL is an open source Java based framework for building data transformations (ETL applications).
The cplusql distributed ETL tool extracts and transforms row based data from databases and flat files for terabyte scale datawarehouse loading.
JetStream is the first open source ETL tool that we used. It is described as a Java Extraction Transformation Service for Transmitting Records & Exchanging Application Metadata: a Java-based ETL/EAI tool.
Apatar ETL tool's modular architecture delivers 1. Visual job designer/mapping 2. Connectivity to all major data sources 3. Flexible Deployment Options (GUI, or server engine with JVM, or embedded).
Don't confuse KETL and KETTLE - they're not related. K.E.T.T.L.E (Kettle ETTL Environment) is a meta-data driven ETTL tool (Extraction, Transformation, Transportation & Loading). Kettle is also available as Pentaho Data Integration.
OpenDigger is a java based compiler for the xETL language. xETL is a language specifically projected to read, manipulate and write data in any format and database. With OpenDigger/XETL you can build powerful Extraction-Transformation-Loading (ETL) prograns.
Talend Open Studio is a mature product, three years in the making before coming out for download. * developer tools: to create process * administrator: to manage distributed process on a grid architecture * launcher tools: to launch process * PAM: Process Activities Monitor The ETL language is PERL, and JAVA. But Perl provide many more connectors than do the java libraries.

Links to OSS OLAP Tools 

Communiteis, Projects and Companies supporting OSS OLAP Tools

On-Line Analytical Processing tools comes in several flavors: MDDB OLAP or MOLAP, Relational OLAP or ROLAP and HOLAP - then there is "H is for hybrid" HOLAP, and there are open source software projects for each type. This list below includes engines or servers as well as front-ends for OLAP or MDDB use.
Mondrian is one of the oldest open source BI components, having been registered in 2001. It is also used as the OLAP engine in other open source software OLAP and BI Suite projects such as JasperAnalysis and Pentaho Analysis. Pentaho provides support for the Mondrian forums.
JasperAnalysis is part of the Jasper BI Suite, available from the JasperForge through JasperIntelligence. Based upon Mondrian and jPivot, JasperAnalysis provides full OLAP standards compliance and analytical capabilities.
PALO is a recent entry to the open source software OLAP field. It's different in that it is esentially an add-in for Micorsoft Excel. PALO provides a MDDB for Excel, with future plans to allow access through other APIs as well. From their homepage... "Palo is an advanced data store for Microsoft Excel that allows you to handle large amounts of Excel data on a small number of worksheets. In addition, it also allows you to share Excel data real-time with your collegues."
Pentaho Analysis Mondrian
Pentaho Analysis uses Mondrian at its core to provide for variable analysis, graphical representations of data, and drill down.
JPivot is a JSP tag library supporting XMLA that provides a front-end OLAP table to the Mondrian OLAP engine, allowing typical OLAP functions such as slice-and-dice, drill-down and roll-up.
pocOLAP is a web-based, cross-tab reporting tool written in Java, that also allows for drill-down. The name comes from "poco", meaning "little" in the Italian and Spanish.
OpenOLAP for MySQL
Currently a Japanese only version of OpenOLAP ported from PostgreSQL to MySQL. The PostgreSQL version is hosted on
OLAP tool for PostgreSQL

Links to OSS Reporting Tools 

Communiteis, Projects and Companies supporting OSS OLAP Tools

Reporting tools can be simple or complex, web-based or not, with designers or not. Here's the list.
JasperReports is one of the oldies as well, starting in 2001. More recently a company, JasperSoft has been formed to invest in JasperReports, as well as to provide support, training and various other services.
Agata Report
From their web site..."Agata Report is a Database Reporting Tool and EIS tool, MIS tool (graph generation), like Crystal Reports. Its written in PHP-GTK and allows you to edit and get SQL results from several databases (PostgreSQL, MySQL, Oracle, SyBase, MsSql, FrontBase, DB2, Informix and InterBase) as as PostScript, plain text, HTML, XML, PDF, or spreadsheet (CSV) formats through its graphical interface. You can also define levels, subtotals, and a grand total for the report, merge the data into a document, generate address labels, or even generate a complete ER-diagram from your database."
DataVision is an Open Source Report Writer that allows drag-and-drop report design through its GUI. It is written in Java and can connect to any database supporting JDBC.
From their website... "OpenReports is a flexible open source web reporting solution that allows users to generate dynamic reports in a browser. OpenReports uses JasperReports, an excellent full featured open source reporting engine, and was developed using leading open source components including WebWork, Velocity, Quartz, and Hibernate."
OpenRPT is a full featured, cross-platform SQL report writer that stores its report definitions as XML, and has a WYWIWYG report writer that can be used in stand-alone or embedded fashion.
jFreeReport is standalone Java report library with a nice series of capabilities and a decent community around it. In January, 2006, jFreeReport became a part of the Pentaho suite.
iReport is now part of the JasperSoft tools and is available as an individual download or as part of the Jasper BI Suite.

Links to OSS Databases Projects 

Communiteis, Projects and Companies supporting OSS RDBMS Projects

There are quite a few open source RDBMS, though few are optimized for query within a VLDB environment.
The EnterpriseDB project takes PostgreSQL and adds Oracle and PL/SQL compatibility to it, making a rather powerful RDBMS open source solution.
Derby is the Apache database project, written in pure java to have a small footprint.
Firebird is a RDBMS written C and C++ that provides many ANSI SQL-99 features as well as stored procedures and triggers. It is basd on the source code released by Borland -> Inprise -> Borland in 2000, and has exsisted in one form or another since 1981
CA released the source code for the vernerable Ingres RDBMS and formed the new Ingres corporation in the November of 2005 under their own CATOS license.
MySQL is reportedly the most deployed open source RDBMS out there. It has proven suitable for VLDB implementations allowing a robust store now, query anytime architecture.
Professor Michael Stonebraker of the Univeristy of California at Berkeley created Postgres as a successor to his other database, Ingres, in 1986. Postgres became Ilustra joined Informix acquired by IBM and found new life in the lab as Postgres95, which was redone and open sourced in 1996 as PostgreSQL Click on the "About" and then "History" link from the main site. Fun stuff, and I even remember it all happening. PostgreSQL is being revamped in a branch distribution specifically for data warehousing in the Bizgres project - see the BI Suites links.
Oracle Berkeley DB ( Sleepycat)
Oracle bought Sleepycat Software. Sleepycat's open source DB came is now released as Oracle Berkeley DB. The three flavors are: Berkeley DB: A transactional storage engine for un-typed data in basic key/value data structures Berkeley DB Java Edition: A pure Java version of Berkeley DB optimized for the Java environment Berkeley DB XML: A native XML database with XQuery-based access to documents stored in containers and indexed based on their content

Links to OSS BI Development Tools 

Communiteis, Projects and Companies supporting OSS Roll Your Own

There are open source tools, platforms and standards to help you "roll your own" BI solutions. These are often very good starting points for experienced development teams, and can help to fill in the gaps in both proprietary and open source solutions.
Eclipse BIRT
Eclipse is the IDE for Java and J2EE, and BIRT is, basically, its reporting plug-in.
EFEU is a programming environment to develop C-programs and libraries. It is often pointed to as facilitating the development of ETL and reporting software.
JpGraph is an OO Graph drawing library for PHP that is very useful for data visualization and presentation.
The linked article describes using EFEU with PostgreSQL to create a multi-dimensional database for use in OLAP.

Links to DW Sources 

Developments in the business intelligence and data warehousing industry tracked by The Data Warehouse Institute (TDWI)
DM Review Portal, website of the DM Review magazine, provides portal of information on business intelligence, analytics, integration and data warehousing


