The R programming language is fast gaining popularity among data scientists to perform statistical analyses. It is extensible and has a large community of users, many of whom contribute packages to extend its capabilities. However, it is single-threaded and limited by the amount of RAM on the machine it is running on, which makes it challenging to run R programs on big data.
There are efforts under way to remedy this situation, which essentially fall into one of the following two categories:
- Integrate R into a parallel database, or
- Parallelize R so it can process big data
In this post, we look at Vertica’s take on “Integrating R into a parallel database” and the two major areas that allow for the performance improvement. A follow on blog will be posted to describe alternatives to the first approach.
1.) Running multiple instances of the R algorithm in parallel (query partitioned data)
The first major performance benefit from Vertica R implementation has to do with running multiple instances of the R algorithm in parallel with queries that chunk the data independently. In the recently launched Vertica 6.0, we added the ability to write sophisticated R programs and have them run in parallel on a cluster of machines. At a high level Vertica threads communicate with R processes to compute results. It uses optimized data conversion from Vertica tables to R data frames and all ‘R’ processing is automatically parallelized between Vertica servers. The diagram below shows how the Vertica R integration has been implemented from a parallelization perspective.
The parallelism comes from processing independent chunks of data simultaneously (referred to as data parallelism). SQL, being a declarative language, allows database query optimizers to figure out the order of operations, as well as which of them can be done in parallel, due to the well-defined semantics of the language. For example, consider the following query that computes the average sales figures for each month:
SELECT avg(qty*price) FROM sales GROUP BY month;
The semantics of the GROUP BY operation are such that the average sales of a particular month are independent of the average sales of a different month, which allows the database to compute the average for different months in parallel. Similarly, the SQL-99 standard defines analytic functions (also referred to as window functions) – these functions operate on a sliding window of rows and can be used to compute moving averages, percentiles etc. For example, the following query assigns student test scores into quartiles for each grade:
SELECT name, grade, score, NTILE(4) OVER (PARTITION BY grade ORDER BY score DESC) FROM test_scores;
Again, the semantics of the OVER clause in window functions allows the database to compute the quartiles for each grade in parallel, since they are independent of one another. Unlike some of our competitors, instead of inventing yet another syntax to perform R computations inside the database, we decided to leverage the OVER clause, since it is a familiar and natural way to express data parallel computations. A prior blog post shows how easy it is to create, deploy and use R functions on Vertica.
Listed below is an example comparing using R and ODBC vs Vertica’ R implementation with the UDX framework.
Looking at the chart above as your data volumes increase Vertica’s implementation using the UDX framework scales much better compared to an ODBC approach. Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.
2.) Leveraging column-store technology for optimized data exchange (query non-partitioned data).
It is important to note that even for non-data parallel tasks (functions that operate on input that is basically one big chunk of non-partitioned data) , Vertica’s implementation provides better performance since computation runs on a server instead of client, and we have optimized data flow between DB and R (no need to parse data again).
The other major benefits of Vertica’s R integration has to do with the UDX framework and the avoidance of ODBC and by the efficiencies obtained by Vertica’s column store. Here are some examples showing how much more efficient Vertica’s integration with ‘R’ is compared to a typical ODBC approach for a query having non-partitioned data.
As the chart above indicates performance improvements are also achieved by the optimizing the data transfers between Vertica and R. Since Vertica is a column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors. Note: Numbers indicated on the chart should only be used for relative comparisons since this is not a formal benchmark.
This blog focused on performance and ‘R’ algorithms that are amenable to data parallel solutions. A following post will talk about our approach to parallelizing R for problems not amenable to data parallel solutions such as if you want to make one decision tree and “Parallelize R” so it can process the results more effectively.
For more details on how to implement R in Vertica please go to the following blog http://www.vertica.com/2012/10/02/how-to-implement