The backend database and HP Operations Analytics

Guest post by Eli Revach, Database  expert, HP Software

 

I am very excited to write my first post which is a part of series, about HP Operations Analytics backend database and how we leverage its abilities. Before I start, I want to say a few words about HP Operations Analytics.

 

Everyone tells their horror stories about big data where organization discovers that data was simply purged in the past. That data is now critical for a current decision and would increase revenue growth.

 

HP Operations Analytics is a new capability of HP Business Service Management, which delivers advance analytic intelligence for structured and unstructured data. This intelligence includes: machine data, logs, events, topologies and performance statistics. You can read more about it here.

 

But as I promised, this blog focuses on the backend database and its special abilities with Operations Analytics.

 

In the recent past, a classic business intelligence reporting solution mainly consisted of FACT’s and Dimension tables. To deal with large data sets, this kind of situation was required to maintain:

  • Time slice, pre-aggregations tables using some kind of batch processing .
  • The maintenances of this aggregations tables (Few examples – The batch process is not transparent to schema change,

Its asynchrony to the RAW data and it required extra storage/extra maintenances and more) and the fact that our solution has to deal with unlimited scale, lead use to choose HP Vertica database as our structured backend data store without the need for time slice pre aggregations tables. We simply runs our queries on top of RAW data.   

 

Vertica is an analytic database with liner scale abilities and fantastic performance. It includes a rich set of built-in analytic functions and it also provides the ability to write your own analytic function in C++ or in R. Vertica is columnar liner scale—it is a share nothing database where data is distributed amount the cluster.

 

Running on top of RAW data and still fast?

Vertica is columnar database, in columnar database each column of the table is store on a different file, data in Vertica is compress and Vertica has the ability to scan the data in parallel on top of compress data and postpone the decompress to later stage of the execution (so scan time is very short)  they  call it “Late Materialization” (one of the few options they  use and of course part of the story ), consider this query:

 

SELECT SALE_DATE,TRANSACTION_ID

FROM SALE_TRANSACTIONS

WHERE SALE_DATE <'10/01/2013'

                AND

       TRANSACTION_ID <100

 

Both SALE_DATE and TRANSACTION_ID columns are scanned in parallel and the data is filtered on top of the compressed files (two separate files in this case). The output of the scanning filters are two  sets of data positions vectors that match the SQL condition filter in each file. This position vectors are being mapped into binary representation: 1 – for exist, 0 for not exists. This allows for Binary AND execution of top of them that generate final positions vector that are being used for additional files access to bring the final “select list” columns. Using this method, the data files are scanned twice, however it improves the overall I/O.

 

Below is high level illustration of “Late Materialization” in a columnar database.

 

 

late materialization.jpg

On the fly dynamic time slice aggregation -

 

In Operations Analytics, the end user can pre-define their dashboards, in addition to the list of dashboards that come with the product. Each dashboard behind the scenes executes an SQL on top of the Vertica database which different time slice aggregation. Giving the fact that the data is not being pre-aggregated—the time slice aggregation is running as a part of the running query on top of the RAW data. Vertica provides a unique analytic function (I don’t see it in other databases)—time_slice function. When the query accepts slice time as a parameter, it will go ahead and do the job for you.  

Below is an example for five min slice time aggregation:

Dynamic Slice.jpg

The same query on other databases will require a join-to-time dimension that includes different combinations of time units. The fact that there is no extra join process required, improves the query dramatically.

 

I hope you have enjoyed this in-depth view of the backend database of HP Operations Analytics. Stay tuned for additional posts that discuss the different aspects of HP Operations Analytics.

 

If you have any questions, feel free to reach out to us in the comments section below. You can also find more information on the Operations Analytics homepage.

 

References:

Vertica Time Slice https://my.vertica.com/docs/6.1.x/HTML/index.htm#13389.htm

HP HAVEn  http://h20195.www2.hp.com/V2/GetPDF.aspx%2F4AA4-7102ENW.pdf

Late Materialization :http://cs-www.cs.yale.edu/homes/dna/papers/abadiicde2007.pdf,

                                  http://cs-www.cs.yale.edu/homes/dna/papers/abadi-sigmod08.pdf,

                                  http://cs-www.cs.yale.edu/homes/dna/talks/abadi-sigmod-award.pdf

 

Comments
| ‎12-18-2013 01:09 AM

great article, thanks Eli!

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
This account is for guest bloggers. The blog post will identify the blogger.
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.