Information Faster Blog

Displaying articles for: June 2014

Introducing Trafodion

Architecture.jpg

 

On Tuesday, June 10, 2014 at HP Discover, Hewlett-Packard announced that the Trafodion project (http://www.trafodion.org) is now available as open source. Trafodion is an enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads. There have been many SQL engines or SQL subsets developed for the Apache Hadoop space, but most of those engines are focused on analytics. Trafodion is meeting a different need: The need for transactional and operational management of data. Using Trafodion with Apache HBase, it is now possible to support OLTP applications on Hadoop. As one of the architects of Trafodion, I’d like to give you a high-level overview of Trafodion’s features.

 

At a glance, Trafodion is a transactional ANSI SQL engine on top of Apache HBase. It has performance improvements for OLTP workloads. It also supports large data sets through parallelism. It offers JDBC and ODBC connectivity for Linux and Windows clients. We’ll go through some of these features in more detail.

 

 Transactions: Trafodion features ACID distributed transactions. This is a key feature for supporting OLTP. A transaction might involve multiple rows, multiple tables and/or multiple statements. Trafodion uses the Multi-Version Concurrency Control model. Instead of locking, Trafodion checks for conflicts between transactions at commit time. This is a good model when transaction conflicts are few. 

 

ANSI SQL: Trafodion supports the core feature set of ANSI SQL-99, with many extensions. Some noteworthy extensions include support for secondary indexes and cost-based optimization. You can create Trafodion tables within HBase. These tables are salted across regions on a set of columns you specify. You can also access native HBase tables directly. There is syntax for accessing individual HBase values. You can access multiple versions of values in one SQL statement.

 

Performance/parallelism: The Trafodion engine supports many optimizations, both compile-time and run-time. For OLTP, there are fast paths for single-row queries. For large data sets, the optimizer can choose a parallel plan. Trafodion’s optimizer searches a space of possible plans for one that is low cost. Determining cost depends in part on statistics about table data gathered with the UPDATE STATISTICS utility. The Trafodion optimizer also caches query plans, so frequently recurring queries need not be repeatedly optimized. Trafodion leverages the natural parallelism inherent in the multiple region servers in the HBase architecture. But it can go beyond that. Imagine, for example, that you are joining two tables, and the join is on a non-key column. Trafodion can generate a query plan using multiple execution servers, each of which will receive a dynamically generated partition of each table. The data are partitioned on a hash of the join column. Join methods supported include hash, nested loop and merge joins. Hash joins do not require all data to be in memory; the algorithm can spill intermediate results to disk files.

 

JDBC/ODBC: JDBC and ODBC drivers for Trafodion are available now for Linux and Windows clients. Connectivity via JDBC and ODBC is provided by Database Connectivity Services (DCS), which provides fault-tolerant access to the Trafodion cluster.

 

We welcome users and contributors to the Trafodion community. If you’d like to download the software, learn more about its architecture or contribute to its future development, visit our Trafodion web site.

The Unstructured Leprechaun

most data that people say is "Unstrucutred" really has structure. The key to finding the value within is to approach it a more specific method, rather than looking for a "overgeneralizing" way of doing it.

The Real-Time Unicorn

This is part one of a series I call the "de-mythification" series, wherein I'll aim to clear up some of the more widespread myths in the big data marketplace.

 

In the first of this multi-part series, I’ll address one of the most common myths my colleagues and I have to confront in the Big Data marketplace today: the notion of “real-time” data visibility. Whether it’s real-time analytics or real-time data, the same misconception always seems to come up. So I figured I’d address this, define what “real-time” really means, and provide readers some advice on how to approach this topic in a productive way.

 

First of all, let’s establish the theoretical definition of “real-time” data visibility. In the purest interpretation, it means that as some data is generated – say, a row of log data in an Apache web server – the data would immediately be queryable. What does that imply? Well, we’d have to parse the row into something readable by a query engine – so some program would have to ingest the row, parse the row, characterize it in terms of metadata, and understand enough about the data in that row to determine a decent machine-level plan for querying it. Now since all our systems are limited by that pesky “speed of light” thing, we can’t move data any faster than that – considerably slower in fact. So even if we only need to move the data through the internal wires of the same computer where the data is generated, it would take measurable time to get the row ready for query. And let’s not forget the time required for the CPU to actually perform the operations on the data. It may be nanoseconds, milliseconds, or longer, but in any event it’s a non-zero amount of time.

 

So “real-time” never, ever means real-time, despite marketing myths to the contrary.

 

There are two exceptions to this – slowing down time inside the machine, or technology which queries a stream of data as it flows by (typically called complex event processing, or CEP). With regard to the first option: let's say we wanted to make data queryable as soon as the row is generated.  We could make the flow from the logger to the query engine part of one synchronous process. So the weblog row wouldn’t actually be written until it were also processed and ready for query. Those of you who administer web and application infrastructures are probably getting gray hair just reading this as you can imagine the performance impact to a web application. So, in the real world, this is a non-starter.  The other option - CEP - is exotic and typically very expensive, and while it will tell you what's happening at the current moment, it's not designed to build analytics models.  It's largely used to put those models to work in a real-time application such as currency arbitrage.

 

So, given all this, what’s a good working definition of “real-time” in the world of big data analytics?

 

Most organizations define it this way: “As fast as it can be done providing a correct answer and not torpedoing the rest of the infrastructure or the technology budget”.

 

Once everyone gets comfortable with that definition, then we can discuss the real goal: reducing the time to useful visibility of the data to an optimal minimum. This might mean a few seconds, it might mean a few minutes, or it might mean hours or longer. In fact, for years now I’ve found that once we get the IT department comfortable with the practical definition of real-time, it invariably turns out that the CEO/CMO/CFO/etc. really meant exactly that when they said they needed real-time visibility to the data. So, in other words, when the CEO said “real-time”, she meant “within fifteen minutes” or something along those lines.

 

This then becomes a realistic goal we can work towards in terms of engineering product, field deployment, customer production work, etc. Ironically, chasing the real-time unicorn can actually impede efforts to develop high speed data flows by forcing the team to chase unrealistic targets for which, at the end of the day, there is no quantifiable business value.

 

So when organizations say they need “real-time” visibility to the data, I recommend not walking away from that conversation until fully understanding just what that phrase means, and using that as the guiding principle in technology selection and design.

 

I hope readers found this helpful! In the remaining segments of this series, I’ll address other areas of confusion in the Big Data marketplace. So stay tuned!

 

Next up: The Unstructured Leprechaun

 

 

Search
Showing results for 
Search instead for 
Do you mean 
About the Author(s)
  • This account is for guest bloggers. The blog post will identify the blogger.
  • For years I've been doing video and music production back and forth between Boston MA and New Orleans LA. Starting in 2010, I've began working with Vertica (now HP Vertica) in the marketing team, doing customer testimonials, product release videos, and website management. I'm fascinated by Big Data and the amazing things my badass team at HP Vertica has done and continues to do in the industry every day.
Follow Us


HP Blog

HP Software Solutions Blog

Labels
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation