Chapter 10

Bringing It All Together

The promises offered by data-driven decision making have been widely recognized. Businesses have been using business intelligence (BI) and business analytics for years now, realizing the value offered by smaller data sets and offline advanced processing. However, businesses are just starting to realize the value of Big Data analytics, especially when paired with real-time processing.

That has led to a growing enthusiasm for the notion of Big Data, with businesses of all sizes starting to throw resources behind the quest to leverage the value out of large data stores composed of structured, semistructured, and unstructured data. Although the promises wrapped around Big Data are very real, there is still a wide gap between its potential and its realization.

That wide gap is highlighted by those who have successfully used the concepts of Big Data at the outset. For example, it is estimated that Google alone contributed $54 billion to the U.S. economy in 2009, a significant economic effect, mostly attributed to the ability to handle large data sets in an efficient manner.

That alone is probably reason enough for the majority of businesses to start evaluating how Big Data analytics can affect the bottom line, and those businesses should probably start evaluating Big Data promises sooner rather than later.

Delving into the value of Big Data analytics reveals that elements such as heterogeneity, scale, timeliness, complexity, and privacy problems can impede progress at all phases of the process that create value from data. The primary problem begins at the point of data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep, what to discard, and how to reliably store what we keep with the right metadata.

Adding to the confusion is that most data today are not natively stored in a structured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display but not for semantic content and search. Transforming such content into a structured format for later analysis is a major challenge.

Nevertheless, the value of data explodes when they can be linked with other data; thus data integration is a major creator of value. Since most data are directly generated in digital format today, businesses have the opportunity and the challenge to influence the creation of facilitating later linkage and to automatically link previously created data.

Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications because of the lack of scalability of the underlying algorithms as well as the complexity of the data that need to be analyzed. Finally, presentation of the results and their interpretation by nontechnical domain experts is crucial for extracting actionable knowledge.

THE PATH TO BIG DATA

During the last three to four decades, primary data management principles, including physical and logical independence, declarative querying, and cost-based optimization, have created a multibillion-dollar industry that has delivered added value to collected data. The evolution of these technical advantages has led to the creation of BI platforms, which have become one of the primary tenets of value extraction and corporate decision making.

The foundation laid by BI applications and platforms has created the ideal environment for moving into Big Data analytics. After all, many of the concepts remain the same; it is just the data sources and the quantity that primarily change, as well as the algorithms used to expose the value.

That creates an opportunity in which investment in Big Data and its associated technical elements becomes a must for many businesses. That investment will spur further evolution of the analytical platforms in use and will strive to create collaborative analytical solutions that look beyond the confines of traditional analytics. In other words, appropriate investment in Big Data will lead to a new wave of fundamental technological advances that will be embodied in the next generations of Big Data management and analysis platforms, products, and systems.

The time is now. Using Big Data to solve business problems and promote research initiatives will most likely create huge economic value in the U.S. economy for years to come, making Big Data analytics the norm for larger organizations. However, the path to success is not easy and may require that data scientists rethink data analysis systems in fundamental ways.

A major investment in Big Data, properly directed, not only can result in major scientific advances but also can lay the foundation for the next generation of advances in science, medicine, and business. So business leaders must ask themselves the following: Do they want to be part of the next big thing in IT?

THE REALITIES OF THINKING BIG DATA

Today, organizations and individuals are awash in a flood of data. Applications and computer-based tools are collecting information on an unprecedented scale. The downside is that the data have to be managed, which is an expensive, cumbersome process. Yet the cost of that management can be offset by the intrinsic value offered by the data, at least when looked at properly.

The value is derived from the data themselves. Decisions that were previously based on guesswork or on painstakingly constructed models of reality can now be made based on the data themselves. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.

Certain market segments have had early success with Big Data analytics. For example, scientific research has been revolutionized by Big Data, a prime case being the Sloan Digital Sky Survey, which has become a central resource for astronomers the world over.

Big Data has transformed astronomy from a field in which taking pictures of the sky was a large part of the job to one in which the pictures are all in a database already and the astronomer’s task is to find interesting objects and phenomena in the database.

Transformation is taking place in the biological arena as well. There is now a well-established tradition of depositing scientific data into a public repository and of creating public databases for use by other scientists. In fact, there is an entire discipline of bioinformatics that is largely devoted to the maintenance and analysis of such data. As technology advances, particularly with the advent of next-generation sequencing, the size and number of available experimental data sets are increasing exponentially.

Big Data has the potential to revolutionize more than just research; the analytics process has started to transform education as well. A recent detailed quantitative comparison of different approaches taken by 35 charter schools in New York City has found that one of the top five policies correlated with measurable academic effectiveness was the use of data to guide instruction.

This example is only the tip of the iceberg; as access to data and analytics improves and evolves, much more value can be derived. The potential here leads to a world where authorized individuals have access to a huge database in which every detailed measure of every student’s academic performance is stored. That data could be used to design the most effective approaches for education, ranging from the basics, such as reading, writing, and math, to advanced college-level courses.

A final example is the health care industry, in which everything from insurance costs to treatment methods to drug testing can be improved with Big Data analytics. Ultimately, Big Data in the health care industry will lead to reduced costs and improved quality of care, which may be attributed to making care more preventive and personalized and basing it on more extensive (home-based) continuous monitoring.

More examples are readily available to prove that data can deliver value well beyond one’s expectations. The key issues are the analysis performed and the goal sought. The previous examples only scratch the surface of what Big Data means to the masses. The essential point here is to understand the intrinsic value of Big Data analytics and extrapolate the value as it can be applied to other circumstances.

HANDS-ON BIG DATA

The analysis of Big Data involves multiple distinct phases, each of which introduces challenges. These phases include acquisition, extraction, aggregation, modeling, and interpretation. However, most people focus just on the modeling (analysis) phase.

Although that phase is crucial, it is of little use without the other phases of the data analysis process, which can create problems like false outcomes and uninterruptable results. The analysis is only as good as the data provided. The problem stems from the fact that there are poorly understood complexities in the context of multitenanted data clusters, especially when several analyses are being run concurrently.

Many significant challenges extend beyond and underneath the modeling phase. For example, Big Data has to be managed for context, which may include spurious information and can be heterogeneous in nature; this is further complicated by the lack of an upfront model. It means that data provenance must be accounted for, as well as methods created to handle uncertainty and error.

Perhaps the problems can be attributed to ignorance or, at the very least, a lack of consideration for primary topics that define the Big Data process yet are often afterthoughts. This means that questions and analytical processes must be planned and thought out in the context of the data provided. One has to determine what is wanted from the data and then ask the appropriate questions to get that information.

Accomplishing that will require smarter systems as well as better support for those making the queries, perhaps by empowering those users with natural language tools (rather than complex mathematical algorithms) to query the data. The key issue is the level of achievable artificial intelligence and how much that can be relied on. Currently, IBM’s Watson is a major step toward integrating artificial intelligence with the Big Data analytics space, yet the sheer size and complexity of the system precludes its use for most analysts.

This means that other methodologies to empower users and analysts will have to be created, and they must remain affordable and be simple to use. After all, the current bottleneck with processing Big Data really has become the number of users who are empowered to ask questions of the data and analyze them.

THE BIG DATA PIPELINE IN DEPTH

Big Data does not arise from a vacuum (except, of course, when studying deep space). Basically, data are recorded from a data-generating source. Gathering data is akin to sensing and observing the world around us, from the heart rate of a hospital patient to the contents of an air sample to the number of Web page queries to scientific experiments that can easily produce petabytes of data.

However, much of the data collected is of little interest and can be filtered and compressed by many orders of magnitude, which creates a bigger challenge: the definition of filters that do not discard useful information. For example, suppose one data sensor reading differs substantially from the rest. Can that be attributed to a faulty sensor, or are the data real and worth inclusion?

Further complicating the filtering process is how the sensors gather data. Are they based on time, transactions, or other variables? Are the sensors affected by environment or other activities? Are the sensors tied to spatial and temporal events such as traffic movement or rainfall?

Before the data are filtered, these considerations and others must be addressed. That may require new techniques and methodologies to process the raw data intelligently and deliver a data set in manageable chunks without throwing away the needle in the haystack. Further filtering complications come with real-time processing, in which the data are in motion and streaming on the fly, and one does not have the luxury of being able to store the data first and process them later for reduction.

Another challenge comes in the form of automatically generating the right metadata to describe what data are recorded and how they are recorded and measured. For example, in scientific experiments, considerable detail on specific experimental conditions and procedures may be required to be able to interpret the results correctly, and it is important that such metadata be recorded with observational data.

When implemented properly, automated metadata acquisition systems can minimize the need for manual processing, greatly reducing the human burden of recording metadata. Those who are gathering data also have to be concerned with the data provenance. Recording information about the data at their time of creation becomes important as the data move through the data analysis process. Accurate provenance can prevent processing errors from rendering the subsequent analysis useless. With suitable provenance, the subsequent processing steps can be quickly identified. Proving the accuracy of the data is accomplished by generating suitable metadata that also carry the provenance of the data through the data analysis process.

Another step in the process consists of extracting and cleaning the data. The information collected will frequently not be in a format ready for analysis. For example, consider electronic health records in a medical facility that consist of transcribed dictations from several physicians, structured data from sensors and measurements (possibly with some associated anomalous data), and image data such as scans. Data in this form cannot be effectively analyzed. What is needed is an information extraction process that draws out the required information from the underlying sources and expresses it in a structured form suitable for analysis.

Accomplishing that correctly is an ongoing technical challenge, especially when the data include images (and, in the future, video). Such extraction is highly application dependent; the information in an MRI, for instance, is very different from what you would draw out of a surveillance photo. The ubiquity of surveillance cameras and the popularity of GPS-enabled mobile phones, cameras, and other portable devices means that rich and high-fidelity location and trajectory (i.e., movement in space) data can also be extracted.

Another issue is the honesty of the data. For the most part, data are expected to be accurate, if not truthful. However, in some cases, those who are reporting the data may choose to hide or falsify information. For example, patients may choose to hide risky behavior, or potential borrowers filling out loan applications may inflate income or hide expenses. The list is endless of ways in which data could be misinterpreted or misreported. The act of cleaning data before analysis should include well-recognized constraints on valid data or well-understood error models, which may be lacking in Big Data platforms.

Moving data through the process requires concentration on integration, aggregation, and representation of the data—all of which are process-oriented steps that address the heterogeneity of the flood of data. Here the challenge is to record the data and then place them into some type of repository.

Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis, all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are machine readable and then computer resolvable. It may take a significant amount of work to achieve automated error-free difference resolution.

The data preparation challenge even extends to analysis that uses only a single data set. Here there is still the issue of suitable database design, further complicated by the many alternative ways in which to store the information. Particular database designs may have certain advantages over others for analytical purposes. A case in point is the variety in the structure of bioinformatics databases, in which information on substantially similar entities, such as genes, is inherently different but is represented with the same data elements.

Examples like these clearly indicate that database design is an artistic endeavor that has to be carefully executed in the enterprise context by professionals. When creating effective database designs, professionals such as data scientists must have the tools to assist them in the design process, and more important, they must develop techniques so that databases can be used effectively in the absence of intelligent database design.

As the data move through the process, the next step is querying the data and then modeling it for analysis. Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis. Big Data is often noisy, dynamic, heterogeneous, interrelated, and untrustworthy—a very different informational source from small data sets used for traditional statistical analysis.

Even so, noisy Big Data can be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. In addition, interconnected Big Data creates large heterogeneous information networks with which information redundancy can be explored to compensate for missing data, cross-check conflicting cases, and validate trustworthy relationships. Interconnected Big Data resources can disclose inherent clusters and uncover hidden relationships and models.

Mining the data therefore requires integrated, cleaned, trustworthy, and efficiently accessible data, backed by declarative query and mining interfaces that feature scalable mining algorithms. All of this relies on Big Data computing environments that are able to handle the load. Furthermore, data mining can be used concurrently to improve the quality and trustworthiness of the data, expose the semantics behind the data, and provide intelligent querying functions.

Virulent examples of introduced data errors can be readily found in the health care industry. As noted previously, it is not uncommon for real-world medical records to have errors. Further complicating the situation is the fact that medical records are heterogeneous and are usually distributed in multiple systems. The result is a complex analytics environment that lacks any type of standard nomenclature to define its respective elements.

The value of Big Data analysis can be realized only if it can be applied robustly under those challenging conditions. However, the knowledge developed from that data can be used to correct errors and remove ambiguity. An example of the use of that corrective analysis is when a physician writes “DVT” as the diagnosis for a patient. This abbreviation is commonly used for both deep vein thrombosis and diverticulitis, two very different medical conditions. A knowledge base constructed from related data can use associated symptoms or medications to determine which of the two the physician meant.

It is easy to see how Big Data can enable the next generation of interactive data analysis, which by using automation can deliver real-time answers. This means that machine intelligence can be used in the future to direct automatically generated queries toward Big Data—a key capability that will extend the value of data for automatic content creation for web sites, populate hot lists or recommendations, and to provide an ad hoc analysis of the value of a data set to decide whether to store or discard it.

Achieving that goal will require scaling complex query-processing techniques to terabytes while enabling interactive response times, and currently this is a major challenge and an open research problem. Nevertheless, advances are made on a regular basis, and what is a problem today will undoubtedly be solved in the near future as processing power increases and data become more coherent.

Solving that problem will require a technique that eliminates the lack of coordination among database systems that host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from a database, performing a non-SQL process, and bringing the data back. This is a major obstacle to providing the interactive automation that was provided by the first generation of SQL-based OLAP systems. What is needed is a tight coupling between declarative query languages and the functions of Big Data analytics packages that will benefit both the expressiveness and the performance of the analysis.

One of the most important steps in processing Big Data is the interpretation of the data analyzed. That is where business decisions can be formed based on the contents of the data as they relate to a business process. The ability to analyze Big Data is of limited value if the users cannot understand the analysis. Ultimately, a decision maker, provided with the result of an analysis, has to interpret these results. Data interpretation cannot happen in a vacuum. For most scenarios, interpretation requires examining all of the assumptions and retracing the analysis process.

An important element of interpretation comes from the understanding that there are many possible sources of error, ranging from processing bugs to improper analysis assumptions to results based on erroneous data—a situation that logically prevents users from fully ceding authority to a fully automated process run solely by the computer system. Proper interpretation requires that the user understands and verifies the results produced by the computer. Nevertheless, the analytics platform should make that easy to do, which currently remains a challenge with Big Data because of its inherent complexity.

In most cases, crucial assumptions behind the data are recorded that can taint the overall analysis. Those analyzing the data need to be aware of these situations. Since the analytical process involves multiple steps, assumptions can creep in at any point, making documentation and explanation of the process especially important to those interpreting the data. Ultimately that will lead to improved results and will introduce self-correction into the data process as those interpreting the data inform those writing the algorithms of their needs.

It is rarely enough to provide just the results. Rather, one must provide supplementary information that explains how each result was derived and what inputs it was based on. Such supplementary information is called the provenance of the data. By studying how best to acquire, store, and query provenance, in conjunction with using techniques to accumulate adequate metadata, we can create an infrastructure that provides users with the ability to interpret the analytical results and to repeat the analysis with different assumptions, parameters, or data sets.

BIG DATA VISUALIZATION

Systems that offer a rich palette of visualizations are important in conveying to the users the results of the queries, using a representation that best illustrates how data are interpreted in a particular situation. In the past, BI systems users were normally offered tabular content consisting of numbers and had to visualize the data relationships themselves. However, the complexity of Big Data makes that difficult, and graphical representations of analyzed data sets are more informative and easier to understand.

It is usually easier for a multitude of users to collaborate on the analytical results when it is presented in a graphical form, simply because interpretation is removed from the formula and the users are shown the results. Today’s analysts need to present results in powerful visualizations that assist interpretation and support user collaboration.

These visualizations should be based on interactive sources that allow the users to click and redefine the presented elements, creating a constructive environment where theories can be played out and other hidden elements can be brought forward. Ideally, the interface will allow visualizations to be affected by what-if scenarios or filtered by other related information, such as date ranges, geographical locations, or statistical queries.

Furthermore, with a few clicks the user should be able to go deeper into each piece of data and understand its provenance, which is a key feature to understanding the data. Users need to be able to not only see the results but also understand why they are seeing those results.

Raw provenance, particularly regarding the phases in the analytics process, is likely to be too technical for many users to grasp completely. One alternative is to enable the users to play with the steps in the analysis—make small changes to the process, for example, or modify values for some parameters. The users can then view the results of these incremental changes. By these means, the users can develop an intuitive feeling for the analysis and also verify that it performs as expected in corner cases, those that occur outside normal circumstances. Accomplishing this requires the system to provide convenient facilities for the user to specify analyses.

BIG DATA PRIVACY

Data privacy is another huge concern, which increases as one equates such privacy with the power of Big Data. For electronic health records, there are strict laws governing what can and cannot be done. For other data, regulations, particularly in the United States, are less forceful. However, there is great public fear about the inappropriate use of personal data, particularly through the linking of data from multiple sources. Managing privacy is effectively both a technical and a sociological problem, and it must be addressed jointly from both perspectives to realize the promise of Big Data.

Take, for example, the data gleaned from location-based services. A situation in which new architectures require a user to share his or her location with the service provider results in obvious privacy concerns. Hiding the user’s identity alone without hiding the location would not properly address these privacy concerns.

An attacker or a (potentially malicious) location-based server can infer the identity of the query source from its location information. For example, a user’s location information can be tracked through several stationary connection points (e.g., cell towers). After a while, the user leaves a metaphorical trail of bread crumbs that lead to a certain residence or office location and can thereby be used to determine the user’s identity.

Several other types of private information, such as health issues (e.g., presence in a cancer treatment center) or religious preferences (e.g., presence in a church), can also be revealed by just observing anonymous users’ movement and usage pattern over time.

Furthermore, with the current platforms in use, it is more difficult to hide a user location than to hide his or her identity. This is a result of how location-based services interact with the user. The location of the user is needed for successful data access or data collection, but the identity of the user is not necessary.

There are many additional challenging research problems, such as defining the ability to share private data while limiting disclosure and ensuring sufficient data utility in the shared data. The existing methodology of differential privacy is an important step in the right direction, but it unfortunately cripples the data payload too severely to be useful in most practical cases.

Real-world data are not static in nature, but they get larger and change over time, rendering the prevailing techniques almost useless, since useful content is not revealed in any measurable amount for future analytics. This requires a rethinking of how security for information sharing is defined for Big Data use cases. Many online services today require us to share private information (think of Facebook applications), but beyond record-level access control we do not understand what it means to share data, how the shared data can be linked, and how to give users fine-grained control over this sharing.

Those issues will have to be worked out to preserve user security while still providing the most robust data set for Big Data analytics.