Big Data Theory
Generally speaking, the term “Big Data” refers to any data that for whatever reason (not just volume) cannot be affordably managed by your traditional systems.
What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data-
- Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
- Social Media Data: Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.
- Structured data: Relational data.
- Semi Structured data: XML data.
- Unstructured data: Word, PDF, Text, Media Logs.
Advantages of Big data-
- More accurate data
- Improved business decisions
- Improved marketing strategy and targeting
- Cost savings: The implementation of a Real-Time Big Data Analytics tools may be expensive, it will eventually save a lot of money. There is no waiting time for business leaders and in-memory databases (useful for real-time analytics) also reduce the burden on a company’s overall IT landscape, freeing up resources previously devoted to responding to requests for reports.
- Fraud can be detected the moment it happens and proper measures can be taken to limit the damage. The financial world is very attractive for criminals. With a real-time safeguard system, attempts to hack into your organization are notified instantly. Your IT security department can take immediately appropriate action.
- Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.
- Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.
- It requires special computer power: The standard version of Hadoop is, at the moment, not yet suitable for real-time analysis. New tools need to be bought and used.
- To manage growing volumes of big data, it is crucial to create a fast, efficient and simple data integration environment. Despite the technological advancements, these tools and technologies are still new and not easily usable in an enterprise environment. Often, these tools require large technical teams; the hardest part is balancing the effectiveness of the technology with the capital and operational cost constraints
- If you were still in the process of combining and constructing one data warehouse for all your enterprise functions, big data may stomp out those plans for good. In its expectations of disruptive tech trends for 2013, Gartner Research writes that the maturity of “strategic big data” will move enterprises toward multiple systems.
- Capturing Data
- Searching/ Sharing/ Transfer/ Analysis
In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software’s can be written to interact with the database, process the required data and present it to the users for analysis purpose.
This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
Or we can say MapReduce is a parallel and distributed solution approach developed by Google for processing large datasets. MapReduce is utilized by Google and Yahoo to power their web search.
MapReduce has two key components. Map and Reduce. A map is a function which is used on a set of input values and calculates a set of key/value pairs. Reduce is a function which takes these results and applies another function to the result of the map function. Or with other words: Map transforms a set of data into key value pairs and Reduce aggregates this data into a scalar. A reducer receives all the data for a individual "key" from all the mappers.
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for huge amounts of data.
The Hadoop framework is implemented in Java, and you can develop MapReduce applications in Java or any JVM-based language or use one of the following interfaces:
- Hadoop Streaming - a utility that allows you to create and run jobs with any executables (for example, shell utilities) as the mapper and/or the reducer.
- Hadoop Pipes - a SWIG-compatible (not based on JNI) C++ API to implement MapReduce applications.