Advanced Computing in the Age of AI | Tuesday, October 3, 2023

ScaleOut Introduces Real-Time Analytics to MapReduce 

ScaleOut Software today announced the general availability of ScaleOut hServer V2, incorporating new technology that runs Hadoop MapReduce on live data. ScaleOut hServer V2 provides a self-contained execution engine for Hadoop MapReduce applications to significantly accelerate performance and eliminate overheads inherent in standard Hadoop distributions. Initial benchmark tests with ScaleOut hServer V2 have demonstrated a 20x speedup in Hadoop execution times.

The initial ScaleOut hServer release in April 2013 provided low-latency, data access for Hadoop. ScaleOut hServer V2 takes the next step in delivering real-time analytics by accelerating the execution of standard Hadoop MapReduce code. Importantly, it also enables fast, concurrent access and updating of data sets held in the IMDG while continuous MapReduce analyses are being performed. This opens the door to the use of Hadoop MapReduce in operational systems which host live, fast-changing data and need to perform real-time analytics within seconds instead of minutes or hours. It also enables scenarios that require fast execution times on static data sets.

While ScaleOut hServer is not intended to replace Hadoop, it does not require Hadoop to be installed. Instead, the product integrates MapReduce functionality and selected Hadoop components within ScaleOut's in-memory data grid and analytics engine, which reduces installation time from days to a few minutes and simplifies deployment. This capability also enables ScaleOut hServer V2 to be used as a fast, easy to use development platform for Hadoop MapReduce applications.

ScaleOut hServer is designed to be compatible with most Java-based Hadoop MapReduce applications developed for the standard Hadoop distributions, requiring only a one-line code change to execute applications using ScaleOut hServer. Applications can input and output data stored either in ScaleOut hServer's IMDG or in external storage repositories, such as the Hadoop Distributed File System (HDFS). The product does not impose a specific limit on the size of the input or result data sets. Instead, only the intermediate data set, which the application inputs to the reducers, must fit within the memory of the IMDG.

To minimize execution time, ScaleOut hServer employs numerous optimizations to minimize data motion during the execution of MapReduce applications, and it can automatically cache HDFS data sets within the IMDG (a feature introduced with ScaleOut hServer V1). In addition, ScaleOut hServer's memory capacity and throughput can be scaled by adding servers to the IMDG's cluster. The product automatically rebalances the data set and execution workload when servers are added or removed.