Guide to NoSQL

Introduction

As part of the HP Software agenda of moving applications to Cloud and SaaS models, many developers and projects are dealing with the NoSQL dilemma. Do we need to use NoSQL? (Everybody is talking about NoSQL, hence, probably we need to use it too.) If yes, what is the most effective way to use it? This blog is not the definitive guide to a specific NoSQL solution. However, in this blog we will review the different examples and examples of when each NoSQL best suits your problem. So, if you came here to learn about Redis, MongoDB, Hadoop or Vertica, you will be very disappointed!

 

Let’s start from the simple question – What is NoSQL?

Wikipedia definition: “NoSQL is a class of database management system identified by its non-adherence to the widely used relational database management system (RDBMS) model.”

 

This definition is not very precise. There are NoSQL relational DBMS systems.  This includes the first NoSQL system created by Carlo Strozzi in 1998 which defined it as NoSQL because it does not express its queries using SQL. It actually could be explained as non-SQL, but this is also incorrect since many NoSQL databases support SQL99 standard.

Another definition we can find in many communities for NoSQL is Not Only SQL. This is also very misleading since not all NoSQL databases support SQL.

 

Finally, the latest movement in this rapidly changing domain is to use a different abbreviator – NewSQL. The best definition I found is in Matthew Aslett’s blog post: “NoSQL is SQL databases that provide scalable/high performance services while changing the SQL language that you manipulate the database data with.” Since this implies horizontal scalability, which is not necessarily a feature of all the products, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

 

So what is the problem with Relational Databases Management Systems (RDBMS)?

 

 

Most RDBMS systems don’t scale to the size required by 21st century systems. They focus on reliability and supporting ACID model (Atomicity, Consistency, Isolation and Durability). As presented by Michael Stonebraker, VoltDB CTO, in “Urban Myths about SQL”, existing RDBMS systems suffer from archaic architectures that are slow by design. They don’t leverage the new hardware capabilities which enable them to get rid of the disk buffer pool, locking, crash recovery and non-efficient multi-threading. The following picture shows the TPC-C CPU cycles distribution during ACID operation:

 

 ACID.jpg

 

 

 

 

In addition, ODBC/JDBC data access is costly because of roundtrips between client and server. Stored procedures with compiled code either to intermediate representation or to C++ (or another programming language) are significantly more efficient.

 

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees (you need to pick two).

 

  • Consistency - all database nodes see the same data, even with concurrent updates.
  • Availability  - every request receives a response about whether it succeeded or failed.
  • Partition tolerance the system continues to operate despite arbitrary message loss or failure of part of the system.

CAP.jpg

 

To support ACID RDBMS you need C + A.

 

Now we need to understand the additional term frequently used by NewSQL/NoSQL solutions – Map-Reduce.

 

MapReduce is a patented software framework introduced by Google and used to simplify data processing across massive data sets. As people rapidly increase their online activity and digital footprint, a huge amount of data is generated continuously. This data can be of multiple types (text, rich text, RDBMS, graph, etc.) and organizations are finding it vital to quickly analyze this huge amount of data generated by their customers and audiences to better understand and serve them. MapReduce is the tool that is helping those organizations in quick and efficient analysis and brings business value to the organization.

 

Map reduce framework is inspired by two functions, map and reduce, which are commonly used in functional programming languages like LISP. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In other words, data sets are broken down into several subsets that are then individually processed in parallel over multiple machines and the resultant data sets are merged together. The requirements for MapReduce can be broken down as follows:

 

  • The data set should be big enough to ensure that splitting up the data will increase overall performance and will not be detrimental to it.
  • The computations are generally not dependent on external input. Only required is the data set that is being processed.
  • The calculations/processing that run on one subset of the data needs to be merged with another subset.
  • The resultant data set should be smaller than the initial data set.

 

MapReduce framework is implemented in many solutions, sometimes explicitly exposed to the user’s programs and sometimes only implicitly and accessible through different data retrieval approach.

 

As we’ll see soon, there are many different NewSQL/NoSQL solutions which are focusing on solving different problems. Let’s review the problems and their possible solutions.

 

1.      New RDBMS systems

 

The following systems support ACID (C + A) and also provide the horizontal scale:

  1. VoltDB – the product of Michael Stonebraker which is designed to support ACID for millions of complex transactions per second and linear scalability.
  2. NuoDB - a re-think of relational database technology, targeted at an elastic cloud of computers rather than a single computer system.
  3. SAP HANA - real-time platform combines high-volume transactions with analytics and text mining.
  4. JustOne - radically re-designed from the ground up and exploits modern hardware characteristics.
  5. Akiban – uses table-grouping™ technology that enables breakthrough performance, scalability and programmability.
  6. ScaleArc - provides an abstraction layer that sits in between application servers and database servers
  7. ScaleBase - provides an abstraction layer with a single point of management for a distributed database environment.
  8. GenieDB - fuses the best of scalable NoSQL and hard-to-scale SQL architectures.
  9. Clustrix - optimized MySQL-compatible software and hardware appliance.
  10. Drizzle - a community-driven open source project that is forked from the popular MySQL database.

The Drizzle team has removed non-essential code, re-factored the remaining code into a plugin-based architecture and modernized the code base moving to C++.

  1. TokuDB – MySQL extended with Fractal Tree® Indexes.
  2. Citrusleaf - real-time NoSQL distributed database with ACID compliance, immediate consistency and 24x7 uptime.

Let’s review architecture of VoltDB in order to understand this section:

 

  • Designed by Michael Stonebraker one of the best computer scientists specializing in database research. His previous projects include Vertica, StreamBase, Ingres, Postgres and many others.
  • Stored procedures. To avoid ODBC/JDBC interactions, VoltDB works only with stored procedures.
  • Based on in-memory storage. To avoid buffer pools and blocking disc operation, VoltDB stores its data in memory.
  • Single threaded and transactional. VoltDB executes transactions using single threaded architecture in timestamp order. The transaction should run on multiple nodes than VoltDB uses “speculative execution”.  In this model every CPU guarantees that it will process transactions in timestamp order.  However, multi-partition transactions require inter-CPU messages and there may be some delay involved.  Instead of waiting to process transactions with higher timestamps (and incurring a stall), the idea is to process a transaction in “tentative” mode until the multi-partition transaction is committed.  If no conflict is observed, the tentative transactions can be committed; otherwise one or more need to be backed out.  Effectively, this is a form of optimistic concurrency control (OCC).
  • No logs. Instead of using logs, persistence is guaranteed by having the data replicated to multiple nodes. Each transaction is run on each node in the same timestamp order. Each replication is enabled for active-active usage.
  • Sharding. Each table could be either replicated to all nodes or partitioned based on sharding configuration.

VoltDB will not replace DWH and OLAP systems as well as specialized data stores: graph-oriented, document-oriented, stream-oriented, object-oriented and key-value stores. As Michael Stonebraker mentions in his blog, using not ACID NoSQL solutions should be considered for applications when:

 

  • The data does not lend itself to relational organization.
  • You can say with certainty that your application will NEVER need competent transaction processing AND no future application that will use the database you’re building will need competent transaction processing.
  • Use of a standard data language (SQL) is not beneficial to your team/organization.

Unless ALL of the above are true, it’s recommended to find an RDBMS that meets your needs.

 

2.      MPP data warehouse solutions

 

“MPP (massively parallel processing) is the coordinated processing of a program by multiple processor s that works on different parts of the program, with each processor using its own operating system and memory. Typically, MPP processors communicate using some messaging interface. In some implementations, up to 200 or more processors can work on the same application. An "interconnect" arrangement of data paths allows messages to be sent between processors. Typically, the setup for MPP is more complicated, requiring thought about how to partition a common database among processors and how to assign work among the processors. An MPP system is also known as a "loosely coupled" or "shared nothing" system. An MPP system is considered better than a symmetrically parallel system (SMP) for applications that allow a number of databases to be searched in parallel. These include decision support system and data warehouse applications” – from http://whatis.techtarget.com

 

MPP solutions can be column-oriented, row-oriented, and a combination of both. In some cases the solution is based only on a software product, in other cases it also leverages specific hardware architecture.

 

The following are examples of MPP DWH solutions:

 

  • Vertica - High performance MPP columnar database with a User-Defined Load System combined with the suite of built-in analytic functions including time-series, event-series and pattern matching.
  • Aster Data - Hybrid row and column massively parallel processing (MPP) Analytic Platform, a software solution that embeds both SQL and MapReduce analytic processing with data stores for deeper insights on multi-structured data sources and types to deliver new analytic capabilities with breakthrough performance and scalability.
  • GreenPlum - Shared-nothing hybrid (row- and column-oriented storage) MPP database with high-performance parallel dataflow engine, and advanced high performance parallel import and export of compressed and uncompressed data from Hadoop clusters software interconnect technology.
  • Netezza - High-performance data warehouse appliances and advanced analytics application for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning.
  • InfiniDB - Column database designed to service the needs of Business Intelligence (BI), data warehouse, and analytical applications where scalability, performance and simplicity are paramount.

Let’s review major points of Vertica’s architecture and understand their importance:

 

  • Column architecture

                                           i.     Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

                                          ii.     Column-oriented systems are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

  • Aggressive data compression

                                           i.     Compression can save big bucks on storage costs; increase data center density, or allow more data to be kept, while simultaneously increasing query performance in cases where I/O is the bottleneck.

                                          ii.     Compression is better on column store - if you walk down attribute columns there is more similarity than if you cut across the rows.

                                        iii.     Data that is well organized compresses better than data that is located haphazardly.

                                         iv.     Vertica Doesn’t Do In-Place Updates. Since new values could come along that don’t compress as well as the old values, some empty space must be left, or updates foregone. Since Vertica puts updates in a separate place (such as the Write Optimized Store), we can squeeze every last bit out of the data.

  • A hybrid transaction architecture that supports concurrent, parallelized data loading and querying

                                           i.     Data warehouses are often queried by day and bulk-loaded by night. The problem is, there’s too much data to load at night and users are demanding more real-time data. Vertica features a hybrid architecture that allows querying and loading to occur in parallel across multiple projections. Each Vertica site contains a memory-resident Write-Optimized Store (WOS) for recording inserts, updates and deletes and a Read-Optimized Store (ROS) for handling queries. WOS contents are continuously moved into the associated ROS asynchronously.

                                          ii.     Lightweight transaction management prevents database reads and writes from conflicting so queries can run against data in the ROS, WOS or in both.

  • Automatic physical database design

                                           i.     Based on DBA-provided logical schema definitions and SQL queries, Vertica automatically determines what projections to construct and where to store them to optimize query database performance and high availability.

  • Multiple physical sort orders (“projections”) and grid/shared-nothing hardware architecture

                                           i.     Vertica supports logical relational models; physically, it stores data as “projections”—collections of sorted columns (similar to materialized views).

                                          ii.     Multiple projections stored on networked, shared-nothing machines (“sites”) can contain overlapping subsets of columns with different sort orders to ensure high availability and enhance performance by executing queries against the projection(s) with the most appropriate columns and sort orders.

  • Automatic “log-less” recovery by query and High availability without hardware redundancy

                                           i.     Rather than having a mirrored database backup sitting idle for failover purposes, Vertica leverages the redundancy built into the database’s projections. It queries projections not only to handle user requests, but also for rebuilding the data in a recently restored projection or site. The Database Designer builds the necessary redundancy into the projections it creates such that a DBA-specified number of site failures can occur without compromising the system.

                                          ii.     Vertica’s approach to recovery avoids bogging down database performance with expensive logging and two-phase commit operations.

  • Connectivity to applications, ETL, and reporting via SQL, JDBC and ODBC

                                           i.     Vertica offers this industry-standard connectivity through ODBC, JDBC, ADO.Net and our rich API as well as native integrations and certifications with a variety of tools like Cognos, MicroStrategy, Tableau and others.

                                          ii.     Vertica offers an Informatica plug-in for ETL processing.

                                        iii.     Vertica provides Hadoop and Pig connectors; users have unprecedented flexibility and speed in loading data from Hadoop to Vertica and querying data from Vertica in Hadoop as part of MapReduce jobs, for example. The Vertica Hadoop and Pig connectors are open source, supported by Vertica, and are available for download.

 

3.      Distributed Batch processing

 

This type of data processing is getting a lot of buzz these days and many projects are evaluating it or using it in their content life cycle. Mike Olson (Cloudera CEO) said “these technologies attempt to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. These technologies apply to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But distributed batch processing can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them.”

 

The best examples of this type of solution is:

 

  • Apache™ Hadoop™ - The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
  • HPCC Systems - High Performance Computing Cluster is a massive parallel-processing computing platform by LexusNexus that solves Big Data problems.
  • The Disco Project - Disco is open-source project developed by Nokia Research Center to solve real problems in handling massive amounts of data.

 

Let’s review Hadoop ecosystem to better understand the problems it attempts to solve.

 

A little history (by Mike Olson): The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

 

Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.

 

In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.

 

Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

 

The major parts of Hadoop ecosystem are:

  • Hadoop Distributed File System (HDFS™) - A distributed file system that provides high-throughput access to application data.
  • Hadoop MapReduce - A software framework for distributed processing of large data sets on compute clusters.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

 

Let’s review HPCC solution and see the difference with Hadoop.

 

HPCC Systems was created 10 years ago within the LexisNexis Risk Solutions division that analyzes huge amounts of data for its customers in intelligence, financial services and other high-profile industries.

 

According to Armando Escalante, CTO of LexisNexis Risk Solutions, the company decided to release HPCC now because it wanted to get the technology into the community before Hadoop became the de facto option for big data processing (from Derrick Harris blog). But in order to compete for mindshare and developers, he said, the company felt it had to open-source the technology.

 

Instead of MapReduce technique, HPCC uses ECL (Enterprise Control Language): a declarative, data-centric language that abstracts a lot of the work necessary within MapReduce. For certain tasks that take a thousand lines of code in MapReduce, he said, ECL only requires 99 lines. Furthermore, he explained, ECL doesn’t care how many nodes are in the cluster because the system automatically distributes data across however many nodes are present. Technically, though, HPCC could run on just a single virtual machine. And, says Escalante, HPCC is written in C++ — like the original Google MapReduce on which Hadoop MapReduce is based — which he says makes it inherently faster than the Java-based Hadoop version.

 

HPCC offers two options for processing and serving data: the Thor Data Refinery Cluster and the Roxy Rapid Data Delivery Cluster. Escalante said Thor — so named for its hammer-like approach to solving the problem — crunches, analyzes and indexes huge amounts of data a la Hadoop. Roxie, on the other hand, is more like a traditional relational database or database warehouse that even can serve transactions to a web front end.

 

The community support around Hadoop provides the significant benefit to this project but, for the other hand, the simplicity of ECL will allow analytics to become more pervasive across the enterprise market.

 

 Part 2 >>

Tags: big data| nosql
Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.