e-Newsletter Issue 57
Apache Spark -- The Most Famous of Big Data Analysis Tools

Beginning in 2011 and due, in part, to the growing popularity of Hadoop, cloud technologies and big data computing started to become emerging trends in the field of IT. Around the same time, Hadoop MapReduce took hold as the big data processing tool of choice. However, MapReduce has one major drawback in that the data it generates must be saved in the Hadoop Distributed File System (HDFS). In other words, when carrying out the iterative algorithm [1], each data input and output procedure must continuously read/write in HDFS, which results in a lot of wasted time being spent on accessing the data.

Compared with MapReduce, Spark is an open source in-memory primitive cluster computing system. Spark saves computing results in temporary memory instead of in HDFS, so that it is much more efficient at executing iterative algorithms. In the Sort Benchmark Competition, it took Spark 23 minutes to finish sorting 100 TB of data, which broke the 70 minute record that MapReduce held previously, thus completing the task in less than one third of the time it took MapReduce to.

Spark has several distinct advantages over MapReduce, including being able to save data in temporary memory, as well as providing resilient distributed datasets (RDD). With RDD, other applications such as streaming processing system--Spark Streaming, machine learning library—Mllib, and data query module—Spark SQL, can easily be expanded. In the Hadoop environment, additional tools such as Storm, Mahout, and Hive are needed to expand these same functions. Spark, on the other hand, is able to effectively solve these issues.

Currently, several Internet companies such as Tencent, Alibaba, Taobao, and Yahoo use Spark instead of MapReduce to construct their own big data analysis platforms. In the past, Taobao used MapReduce to solve complicated machine learning problems, but its efficiency was low and it was hard to maintain. Eventually, Taobao switched to Spark to help solve the machine learning iterative algorithm problem and, at the same time, began using it on its recommendation system as well.

Tencent is another company that switched from MapReduce to Spark. Tencent used Spark to create its recommendation system and build its log data query system. Tencent switched to Spark in order to take advantage of its ability to save data in temporary memory, as well as its query efficiency, which is 2X~10X higher than that of Hive’s. Considering the ongoing development and strong demands of big data analysis, Spark is becoming increasingly important in the processing of big data.

[1] The iterative algorithm is a computing process that starts with initial numerical numbers. Each computing result then becomes one of the numerical numbers for computation. It doesn’t stop until the computing results converge at an ideal value.

 Share This Page
內頁-焦點新聞圖示 內頁-焦點新聞小圖
Examining New Trends in Research and Education Networks in Response to the Emerging Push of Big Data
內頁-每月一圖圖示 內頁-每月一圖小圖
The “Remote Crack Measuring System and Device” Wins the Platinum Award at the 2015 Taipei International Invention Show & Technomart
NARLabs and Argonne National Laboratory Co-host the Smart Cities and Urban Analytics Workshop
“Angel Star” Wins the Fourth Annual NCHC HPC Kung Fu - 3D Animation Challenge!
DevOps—The Innovation of Continuous Software Delivery
Building a Patabyte-scale LOG Analysis Platform Using Elasticsearch
Apache Spark -- The Most Famous of Big Data Analysis Tools