The backend database and HP Operations Analytics Part 2

Guest post by Eli Revach, Database  expert, HP Software

 

The next generation of HP Operations Analytics will be available soon and it will include new amazing capabilities, you can read more about it here.  While developing this version we faced many database challenges and I would like to share one of them with you. In this second blog post I will focus on how database joins are executed on Big Data platforms. You can read my first blog post here.

 

One of the major challenges of Big Data platforms is the ability to join two different data sets (two different tables).  For example, Apache Hadoop will perform a join by leveraging a distributed cache of small table / using external in-memory database or using Bloom filter. All previously mentioned options are very reasonable from a performance point of view for offline analytics, but they are not good enough for online analytics.

 

 

Many Big Data customers leverage Hadoop as their backend data store for processing lower level raw data and normalizing it for online analytics on top of analytic databases.  HP Operations Analytics is an online analytic platform that leverages some of the Hadoop ecosystems and uses HP Vertica as its analytic database.  We don’t use Hadoop for data store, so how are JOIN’s being executed on HP Vertica?

 

HP Vertica supports SQL joins:

HP Vertica supports two major join methods, Merge and HASH. Nested loop JOIN is not supported as it is not suitable for large join processing. 

 

Classic HASH join in relational row store database is performed using the following steps:

  • The relatively small DIMENSION table is hashed in memory (step 1, 2).
  • The big FACT table is scanned (full table scan) to probe the hash (build) table for matches (step 3, 4).

 HASH join in row store databases.png

 

Figure 1: HASH join in row store databases

                       

The scan of the FACT table is done in parallel, however it is still heavy and time consuming operation. The main reasons for this are:

  1. In row store databases, full table scan reads entire FACT table (a lot of I/O).
  2. All FACT records should be joined with DIMENSION hash table that results in very big join set.

 

How this join is performed in Vertica?

Vertica uses the method of Sideways Information Passing a.k.a., SIP, a method that has been developed in the academy and is in use by several database vendors in order to reduce the size of a join set. To simplify the explanation of what SIP is, we can say that  it’s a method for pushing down predicates in order to reduce the join size between the FACT and the DIMENSION table (less memory less temporary disk space during the join).

 

During query optimization, the optimizer rewrites the query by pushing the predicate down and in such a way causing the addition of the Runtime filter attribute to the FACT table scanning phase, this Runtime filter includes values from the DIMENSION table. As a result, in this step a very compact FACT table row set is created and it leads to very small join set (now we have small FACT table joined with DIMENSION table which is by nature relatively small). We can see this Runtime filter as a SQL INLIST filter that includes a list of keys from our DIMENSION (Figure 2 - step 3). This INLIST is being pushed-down to the FACT scanning phase and leads to the optimal size of the FACT table and optimal HASH join (Figure 3).

 

Vertica HASH join.png

Figure 2: Vertica HASH join

 

 

 execution plan.png

Figure 3: Execution plan, including SIP

 

 

 

If you are interested in Vertica, you can download the community version here.

 

If you have any questions, feel free to reach out to us in the comments section below. You can also find more information on the Operations Analytics homepage.

 

References:

HP HAVEn data sheet  

SIP

Bloom filter

Comments
Vivek Singh(anon) | ‎04-03-2014 01:56 AM

I would like to know the set of best practices to be followed while implementing DW in 100's of TB scale .

I havent seen any documentation regarding such systems from vertica. 

I would like to know more about

1.capacity planning ,

2.data loading

3.query performance optimization on such scale

4.Resource allocation

5.Troubleshooting

Guest Blogger (HPSW-Guest) | ‎04-22-2014 09:24 AM

Hi Vivek,

 

Below are some answers :

Vertica have capacity planning document on their site if you like I can send it to you .

Performance , there is no magic , you need to set projections to answer your use cases otherwise performance can be bad , to get the best performance try to avid Joins as much you can .

Data Load  -  very fast , the best practices is to  load in parallel , so each loader will connect to different Vertica server .

Resource allocation – Vertica have the concept  resource manager , you can manage your memory and CPU resources per resource pool .

 

Thanks,

Eli

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
This account is for guest bloggers. The blog post will identify the blogger.


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation