Preface

What are data? This seems like a simple enough question; however, depending on the interpretation, the definition of data can be anything from “something recorded” to “everything under the sun.” Data can be summed up as everything that is experienced, whether it is a machine recording information from sensors, an individual taking pictures, or a cosmic event recorded by a scientist. In other words, everything is data. However, recording and preserving that data has always been the challenge, and technology has limited the ability to capture and preserve data.

The human brain’s memory storage capacity is supposed to be around 2.5 petabytes (or 1 million gigabytes). Think of it this way: If your brain worked like a digital video recorder in a television, 2.5 petabytes would be enough to hold 3 million hours of TV shows. You would have to leave the TV running continuously for more than 300 years to use up all of that storage space. The available technology for storing data fails in comparison, creating a technology segment called Big Data that is growing exponentially.

Today, businesses are recording more and more information, and that information (or data) is growing, consuming more and more storage space and becoming harder to manage, thus creating Big Data. The reasons vary for the need to record such massive amounts of information. Sometimes the reason is adherence to compliance regulations, at other times it is the need to preserve transactions, and in many cases it is simply part of a backup strategy.

Nevertheless, it costs time and money to save data, even if it’s only for posterity. Therein lies the biggest challenge: How can businesses continue to afford to save massive amounts of data? Fortunately, those who have come up with the technologies to mitigate these storage concerns have also come up with a way to derive value from what many see as a burden. It is a process called Big Data analytics.

The concepts behind Big Data analytics are actually nothing new. Businesses have been using business intelligence tools for many decades, and scientists have been studying data sets to uncover the secrets of the universe for many years. However, the scale of data collection is changing, and the more data you have available, the more information you can extrapolate from them.

The challenge today is to find the value of the data and to explore data sources in more interesting and applicable ways to develop intelligence that can drive decisions, find relationships, solve problems, and increase profits, productivity, and even the quality of life.

The key is to think big, and that means Big Data analytics.

This book will explore the concepts behind Big Data, how to analyze that data, and the payoff from interpreting the analyzed data.

Chapter 1 deals with the origins of Big Data analytics, explores the evolution of the associated technology, and explains the basic concepts behind deriving value.
Chapter 2 delves into the different types of data sources and explains why those sources are important to businesses that are seeking to find value in data sets.
Chapter 3 helps those who are looking to leverage data analytics to build a business case to spur investment in the technologies and to develop the skill sets needed to successfully extract intelligence and value out of data sets.
Chapter 4 brings the concepts of the analytics team together, describes the necessary skill sets, and explains how to integrate Big Data into a corporate culture.
Chapter 5 assists in the hunt for data sources to feed Big Data analytics, covers the various public and private sources for data, and identifies the different types of data usable for analytics.
Chapter 6 deals with storage, processing power, and platforms by describing the elements that make up a Big Data analytics system.
Chapter 7 describes the importance of security, compliance, and auditing—the tools and techniques that keep large data sources secure yet available for analytics.
Chapter 8 delves into the evolution of Big Data and discusses the short-term and long-term changes that will materialize as Big Data evolves and is adopted by more and more organizations.
Chapter 9 discusses best practices for data analysis, covers some of the key concepts that make Big Data analytics easier to deliver, and warns of the potential pitfalls and how to avoid them.
Chapter 10 explores the concept of the data pipeline and how Big Data moves through the analysis process and is then transformed into usable information that delivers value.

Sometimes the best information on a particular technology comes from those who are promoting that technology for profit and growth, hence the birth of the white paper. White papers are meant to educate and inform potential customers about a particular technology segment while gently goading those potential customers toward the vendor’s product.

That said, it is always best to take white papers with a grain of salt. Nevertheless, white papers prove to be an excellent source for researching technology and have significant educational value. With that in mind, I have included the following white papers in the appendix of this book, and each offers additional knowledge for those who are looking to leverage Big Data solutions: “The MapR Distribution for Apache Hadoop” and “High Availability: No Single Points of Failure,” both from MapR Technologies.