Parallel Processing of Big Data

Parallel Processing of Big Data

Executive Summary

On the online platform, there has been increased data from multiple sources resulting in the rise of the big data concept. Such data emanates from diverse databases. To process such data, firms are making use of parallel computing. Parallel computing facilitates the analyses of multiple data and the execution of various tasks concurrently. It requires that a computer has multiple processors. Through the use of parallel processing, firms have been able to engage in in-depth data analytics, gaining significant information from massive data which may have been difficult to analyze without the access of such data.  Companies such as IBM have established parallel processes of analyzing big data. This has increased access to artificial intelligence by firms as they analyze data to gain specific meanings.

 

 

 

 

 

 

 

 

 

 

Parallel Processing of Big Data

Introduction

In contemporary organizations, there has been increased use of the internet. It is being used in shopping, communication as well as in meeting other goals. There are a lot of emails that are sent to different entities on a daily bases.  According to the research of IBM, there are 100 terabytes of data uploaded on Facebook, 294 billion emails are sent, and 230 million tweets on Twitter every day, and these actions generate 5 Exabyte data executed in a single operating system (Luo 2017, p8).

Literature Review

Big data refers to a collection of data sets which are usually large and hard to manually analyses unless a relevant software is used. Big data is characterized by high volume, high velocity, diversity in terms of its variety, veracity as well as differing value. Volume refers to the quantity of the data that is generated.as a result of the high level of data that is generated, the need for high capacity to facilitate storage of such data is necessary.

Velocity refers to the speed at which the data is created. Velocity is important to companies as it facilitates the collection of real-time information. Obtaining real-time data allows organizations to receive such data on time as well as make decisions that are related to such data promptly. Veracity is another element that is considered when it comes to big data. It refers to the trustworthiness and the accuracy of the data that is collected (Al-Oraiqat, 2010, P35).   Accurate data results in better decision making in organizations.

 

Source of Big Data

Big data comes from multiple sources. One of the key sources of big data is social networks. People make use of social networks in the sharing of information. Another key source of big data is the government. The government shares a huge volume of data with the public ensuring that such evidence is helpful to the public (Almeida, 2015, p25). The media outlets are another key source of big data. It results in the sharing of extensive information within the organizations. Sharing of data is thus an important element that results in the increased transfer of data on the online platform.  In order to understand all the information that is shared on the media outlets, it is important to analyze such information in an extensive manner. This is facilitated through the use of the right software in the analyses of data.

Parallel processing of big data

Parallel processing is one of the key cost-effective methods that are employed in the analyses of data. Efficient parallel processing mechanisms are important in handling large data as well as in its processing. By mixing parallel processing with parallel computing, organization performance is enhanced. It allows for the partitioning of the inputs increasing the ability of the organization to analyze data.

In computers, parallel processing involves the division of a program so that it can be run within a short time. In the earliest computers, only one program ran at a time. A computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015). An early form of parallel processing allowed the interleaved execution of both programs together. The next improvement in parallel processing is the multiprogramming. In a multiprogramming process, the multiple programs which are to be employed in an organization are properly synchronized.

Vector processing is another approach that is used in executing more than one program within the same time. This is valuable in various engineering applications where data collected is normally in the form of vectors thus necessitating vector analyses (Bajpeyee, Sinha, & Kumar, 2015). Reports of such data result in a symmetric multiprocessing system (SMP). As the number of processors in SMP systems increases, the time it takes for data to propagate from one part of the system to all other parts grows also. When the number of processors is many, the performance benefit of adding more processors to the system is too small to justify the additional expense. To get around the problem of long propagation times, message passing systems are created (Cattell, 2011).  In these systems, programs that share data send messages to each other to announce that particular operands have been assigned a new value. Instead of a broadcast of an operand’s new value to all parts of a system, the new value is communicated only to those programs that need to know the new value. Instead of shared memory, organizations establish a network to support the transfer of messages between the various applications that are presented by the organization. This allows the use of multiple processors in the analyses of the presented data.

In the parallel processing of the big data, such data is broken down into many separate, independent operations on vast quantities of data. In data mining, there is a need to perform multiple searches of a static database. Such database ensures that the organization that assists it in addressing the phenomena under inquiry. In the use of artificial intelligence, it is essential to make use of multiple alternatives in the decision-making process. This will ensure that there are multiple solutions that are generated from the study and which can be compared in the decision-making process (Chen, & Zhang, 2014, p340).  It is important to analyze clusters of information. It is in such clusters that information is hidden which may help the organization in meeting its goals.  In parallel analyses of data, it is important to ensure that the analyzed data has high chances of providing a solution to the problem under inquiry. Nevertheless, it is important to use multiple data which can help in making sure that the solutions provided in such process of big data analyses are diverse and that they provide the required artificial intelligence to the organization.

Parallel Database

In the early periods of the adoption of computer-based data, there was little data that was held in organizations in their databases. As the operations of organizations intensify, there has been an increased volume of data which is shared within the organizations. This has created a need for such organizations to establish data for analyses. Relational queries are more ideally suited to parallel executions (Chen, Mao, & Liu, 2014, p200). Every relational query can be transferred into operations like scan or sort. Through the operators, source data can be produced as same as the client requests. Each stream comes from the data source and then becomes an input of the first operator- which produces an output that is used as an input of the second operator, and eventually, the output is generated from merged input of the second operator. Through the use of databases, organizations are in a position to gather large data as well as take it through analyses. This process assists in making sure that the organization generates a better understanding of the data collected. On the online platform, organizations are in a position to gather big data regarding the operations of firms in the industry. The analyses of such data assist organizations in making decisions regarding the opportunities as well as the threats that are in existence in the industry. The analyses of big data is thus of high value in organizations. It assists organizations in the process of decision making. The hardware infrastructure influences how organizations manage big data. For an organization to process data well, it needs to have reliable random access memory (Luo 2017). This ensures that the entity can hold a lot of the data in the database as well as ensure that the database has adequate speed. Without the right speed, it would be difficult for the organization to process the provided data in an effective manner. It is thus important for the organization to scale up and influence the rate of analyses of the data provided.

In order for an organization to enhance data processing, it is important for it to ensure that it has adequate shared memory architecture.  This will ensure that the data is easy to access through the use of the processor. Adding of the processors will increase the internal traffic. It is thus the role of the organization to ensure that the synchronization process is carried out in an effective manner.

Another key element that is associated with effective data processing is the share disk infrastructure.  Each process in the infrastructure has its own won memory. Nevertheless, through proper networking, the processors can access memory in other discs. They are thus important elements that are used in the decision-making process.  There are many merits associated with the use of shared infrastructure. For instance, data in the discs are still accessible even if the nodes are dead. The shared database is also easy to add more processors something that increases the ability of the system to handle big data. Nevertheless, coordination of the processors is essential if the organization is to engage in useful analyses of the data provided. Proper processing of information is thus critical in the organizations and should be enhanced.

 

 

 

Comparisons

Traditionally, the software has been established with the aim of enhancing serial computation. The instructions are executed sequentially, one after the other.  They are also implemented within a single processor. Only on education is executed at a time as indicated in the diagram below

Source: Matsuura & Miyatake  (2014).

When the above approach is used in the processing of the payroll, the outcome would be as indicated in the diagram below.

Source: Gandomi, & Haider, (2015, p143).

The above approach to computing has evolved over time result9ing in the development of parallel computing of big data. A problem under parallel computing is broken to different parts which are then computed concurrently. Each of the parts is also broken down into a series of instructions. The instructions on each of the parts are executed simultaneously by different procedures. In order to achieve this, it is important to ensure that control and coordination of the entire process has been carried out. An example of parallel processing is as depicted by the chart below

Source: Gandomi, & Haider, (2015, p142).

 

The above process of parallel computing may be employed in computing information relating to the payroll processing as indicated in the diagram below

Source: Han & Ahn (2014, p85).

In the above process, the computational problem should be broken into key discrete parts that can be solved simultaneously.  It should also be in a position to execute any part of the program at any moment. Today, all computers are parallel when it comes to the hardware perspective.  Multiple functional units such as L1 cache, L2 cache, decode, floating point, graphics and integers are used in addressing issues regarding a particular activity.  Multiple hardware threads, as well as multiple execution units and cores, are used.  Networks mainly connect multiple standalone computers which makes more massive parallel computer clusters as indicated in the diagram below

Source: Hashem, Yaqoob, Anuar, Mokhtar, Gani, & Khan (2015, p88).

In the internal computing processes in a computer, the parallel computing processes can also be based on the inputs fed to the system, the mappers, combiners, reducers and finally the output issued. This indicates that there are multiple aspects that are involved in parallel computing.  Just as the name suggests, multiple activities are executed at the same time. This is as depicted in the diagram below

 

Source: Kambatla, Kollias, Kumar, & Grama (2014, p2565

Based on the above analyses, it is evident that parallel computing is an important factor. It assists in the processing of multiple instructions within a short time. Equally, it ensures that multiple inputs can be processed at the same time in order to generate the desired outputs. Outputs are not based on a single instruction (Rajan, van Ginkel, Sundaresan, Bardhan, Chen, Fuchs, Manadhata (2013, p2)). Rather, they may involve a combination of instructions with the aim of making sure that the desired elements in the output are combined in the right manner. Parallel computing is thus a better-advanced approach to computing as compared to the other forms of computing. In contemporary computing, many organizations are making use of parallel computing. This approach to computing assists in the analyses of big data. The differences in the serial and parallel computing are as indicated in the table below

 

Serial Parallel computing
One task is executed at a time Tasks are executed simultaneously
Takes long before all of the tasks can be completed Time-saving
In serial processing, similar tasks are completed at the same time In parallel processing tasks are completed at different times
In serial processing tasks are executed time and hence there are no chances of the processor heating up (Hashem, Yaqoob, Anuar, Mokhtar, Gani, & Khan, 2015, p105) In sequential processing, the load is high on a single processor something that results in the processor heating up
In serial processing, the data is processed in bit by bit form In parallel processing, the data transfers are made in the form of bytes
Serial processors are cheaper than the parallel processors. Parallel processors are expensive
Serial processing requires a lot of time to complete Parallel processors take only a short time to complete

 

As indicated in the above table, it is evident that there exist significant differences between parallel computing and serial computing.  The similarities between serial and parallel computing are summarized below

  • They are all aimed at providing an output as a solution
  • They both rely on input in order to generate the desired outputs
  • They all require the use of processors which are used in the processing of data that is input into the system
  • They require that the computer have adequate memory where the data processed is stored.

As indicated above, while there are significant differences between the serial and parallel computing, it is evident that there are also significant similarities. Nevertheless, with the advancement of technology, the majority of computers will be based on parallel processors (Zaharia, Chowdhury, Franklin, Shenker, & Stoica, 2010). Such parallel processors will ensure that computers can be used in the processing of the data within a short time.

Methodology

Map reduce is a form of parallel programming application that is used in the processing of large data sets. There are two key functions in the map-reduce programs. These are the map function and map reduces. Both the map function and the map-reduce are usually written by the users of these systems.  The purpose of the map is to accept the input that is fed by the user to the system.  It then produces a set of intermediate results which are then sent to the reduce function. This method assists in the data mining process.

Case study

The method of analyses used in this study is a case study method. This approach ensures that the phenomena under study is learned in depth. It result ensures that such aspects are understood in a real organization. The study is based on the collection of secondary data from reliable sources. IBM is one of the leading organizations in the information technology sector. Besides the mainframe power systems that the organization makes use of the organization is involved in data warehousing. Data mining thus involves the analyses and retrieval of multiple data that is used by organizations. The organization makes use of SPSS and cognos in its data analytics. The existence of big data is emerging as a challenge to many organizations that lacks access to similar tools as applied by IBM (IBM, 2019, p1). It is worth noting that big data management involves the management of data from multiple sources. According to IBM artificial intelligence and the social and the internet of things IOT are driving data complexity. It is assisting in the analyses of both structured data as well as unstructured data (IBM 2019, p1). Big data applies to data Sestos whose type and size is beyond the ability of the traditional databases when it comes to capturing. One of the key initiatives that IBM has taken in the analyses of big data is it’s teaming up with Horton works in the provision of enterprise-grade Hadoop distribution. This solution which is cost effective provides the organization and its clients to analyze big data successfully. The key big data approaches employed by the organization include the adoption of new data formats.  It utilizes new forms analyzing semi-structured as well as unstructured data. It involves the data that cannot be ingested within the enterprise data warehouse. The approach to analyses results in more accurate analytic decisions in response to modern technology demands which includes the internet of things IOT and artificial intelligence.  The organization equally engages in data lake analyses.  This form of parallel studies involves real-time self-service access to advanced analytics by its customers. The Hadoop data lake provides a key aspect of analyses of data which is likely to be further advanced in the future.  The entity equally offered services in the data offload applications.  This is used in optimizing the organizations in house data warehouse resulting to better data analyses outcomes. Not only do the organization use these elements internally, but it connects its customers with a platform on which they can make use of these applications.

In its operations, IBM often makes use of parallel computing in data analyses and mining. In its data analytics, the organization encountered differing forms of data (IBM, 2019). In order to realize meaning from such data, it makes use of parallel data analyses, ensuring that diverse processors are employed in data analyses. This results in the use of multiple decision-making processes in the analyses of the data presented by the organization from diverse sources. Parallel analyses ensure that multiple tasks are executed concurrently

`           In its products and operations, IBM has established solutions that enable it to engage in parallel processing. One of its key applications is DB2. The application requires that the computer have a central processor.  They should also contain coupled processors. It is also called the central processors or the CPs. If only one CP is online at a time when the query is bound, Db2 considers only the parallel operations (IBM 2018, p1). Query 1/0 is depreciated and there are many chances that it will be depreciated in the future. DB2 considers only the bind with RR application of the IS isolation. In its operations, IBM tunes the parallel processing through a number of steps.  The first step involves increasing the buffer pools. It is then followed by an increase of the buffer parameters in order to enhance the degree of parallelism.  The alter buffer pool command is used in increasing the total buffer pool size.  These measures enable the organization to engage in parallel computing of big data. It allows for multiple activities to be done on the same device that has multiple processors.

Conclusion

As indicated above, there has been an increase in big data online. Such data is gathered from social sites, organizational sites as well as individuals’ in feeds on the online platform.  In order to engage in parallel processing of big data, multiple processors are put on the same computer. This ensures that multiple actions are carried out on time. This is unlike serial processing where the process involves the processing of one task at a time. Parallel processing is thus an essential element when it comes to the processing of big data in organizations. Companies such as IBM have established applications that significantly enables firms to engage in significant data analyses.

 

 

 

 

References

Al-Oraiqat A. (2010).  Parallel implementation of a vehicle rail dynamical model for multi-

core systems. Int J Adv Stud Comput Sci Eng, 6(1):34–41.

Almeida, J. B. (2015). A comprehensive overview of open source big data platforms and

frameworks. International Journal of Big Data (IJBD), 2(3), 15-33.

Assunção, M.  D.,  Calheiros,  R.  N., Bianchi,  S., Netto,  M.  A., & Buyya,  R.  (2015). Big

Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79(1), 3-5.

Bajpeyee, R., Sinha, S. P., & Kumar, V. (2015). Big Data:  A  Brief Investigation  on  NoSQL

Databases. International Journal of Innovations & Advancement in Computer Science, 4(1), 28-35.

Cattell,  R.  (2011).  Scalable  SQL and  NoSQL  data stores. Acm Sigmod Record, 39(4), 12-27.

Chen, C.  P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques, and

technologies:  A survey on Big  Data.  Information Sciences, 275(1), 314-347.

Chen, M., Mao,  S.,  & Liu,  Y. (2014).  Big data:  A survey.  Mobile Networks and Applications,

19(2), 171-209.

IBM (2018). Tuning parallel processing. Retrieved from

https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/perf/src/tpc/db2z_tuneparallelprocess.html

IBM (2019). Methods of parallel processing. Retrieved from

https://computing.llnl.gov/tutorials/parallel_comp/

Luo, C. (2017). Survey of parallel processing on big data. Retrieved from

Click to access 58914aae88752d095d9602364a8b9410f471.pdf

Matsuura G, & Miyatake M. (2014). Optimal train speed profiles by dynamic programming with

 

Parallel computing and the fine-tuning of mesh. WIT Trans Built Environ.135 (1), 767–777.

Matsuura G, & Miyatake M. (2014). Optimal train speed profiles by dynamic programming

with parallel computing and the fine-tuning of mesh. WIT Trans Built Environ.135(1), 767–777.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and

analytics. International Journal of Information Management, 35(2), 137-144.

Han, U., &  Ahn, J. (2014). Dynamic load balancing method for apache flume log processing.

Advanced Science and Technology Letters, 79, 83-86.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S.  U. (2015). The

rise of big data on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Kambatla, K., Kollias, G., Kumar, V., & Grama,  A. (2014).  Trends in  big  data  analytics.

Journal of Parallel and Distributed Computing,  74(7),  2561-2573.

Rajan, S., van Ginkel, W., Sundaresan, N., Bardhan, A.,  Chen,  Y.,  Fuchs,  A.,  Manadhata,  P.

(2013). Expanded top  ten big data security and privacy challenges.  Cloud  Security  Alliance, available at https://cloudsecurityalliance.  org/research/big-data/

 

Zaharia M.,  Chowdhury,  M.,  Franklin,  M.  J., Shenker, S., &  Stoica,  I.  (2010). Spark:

Cluster Computing with Working Sets. HotCloud, 10(1), 95.