Chapter 8

The Evolution of Big Data

To truly understand the implications of Big Data analytics, one has to reach back into the annals of computing history, specifically business intelligence (BI) and scientific computing. The ideology behind Big Data can most likely be tracked back to the days before the age of computers, when unstructured data were the norm (paper records) and analytics was in its infancy. Perhaps the first Big Data challenge came in the form of the 1880 U.S. census, when the information concerning approximately 50 million people had to be gathered, classified, and reported on.

With the 1880 census, just counting people was not enough information for the U.S. government to work with—particular elements, such as age, sex, occupation, education level, and even the “number of insane people in household,” had to be accounted for. That information had intrinsic value to the process, but only if it could be tallied, tabulated, analyzed, and presented. New methods of relating the data to other data collected came into being, such as associating occupations with geographic areas, birth rates with education levels, and countries of origin with skill sets.

The 1880 census truly yielded a mountain of data to deal with, yet only severely limited technology was available to do any of the analytics. The problem of Big Data could not be solved for the 1880 census, so it took over seven years to manually tabulate and report on the data.

With the 1890 census, things began to change, thanks to the introduction of the first Big Data platform: a mechanical device called the Hollerith Tabulating System, which worked with punch cards that could hold about 80 variables. The Hollerith Tabulating System revolutionized the value of census data, making it actionable and increasing its value an untold amount. Analysis now took six weeks instead of seven years. That allowed the government to act on information in a reasonable amount of time.

The census example points out a common theme with data analytics: Value can be derived only by analyzing data in a time frame in which action can still be taken to utilize the information uncovered. For the U.S. government, the ability to analyze the 1890 census led to an improved understanding of the populace, which the government could use to shape economic and social policies ranging from taxation to education to military conscription.

In today’s world, the information contained in the 1890 census would no longer be considered Big Data, according to the definition: data sets so large that common technology cannot accommodate and process them. Today’s desktop computers certainly have enough horsepower to process the information contained in the 1890 census by using a simple relational database and some basic code.

That realization transforms what Big Data is all about. Big Data involves having more data than you can handle with the computing power you already have, and you cannot easily scale your current computing environment to address the data. The definition of Big Data therefore continues to evolve with time and advances in technology. Big Data will always remain a paradigm shift in the making.

That said, the momentum behind Big Data continues to be driven by the realization that large unstructured data sources, such as those from the 1890 census, can deliver almost immeasurable value. The next giant leap for Big Data analytics came with the Manhattan Project, the U.S. development of the atomic bomb during World War II. The Manhattan Project not only introduced the concept of Big Data analysis with computers, it was also the catalyst for “Big Science,” which in turn depends on Big Data analytics for success. The next largest Big Science project began in the late 1950s with the launch of the U.S. space program.

As the term Big Science gained currency in the 1960s, the Manhattan Project and the space program became paradigmatic examples. However, the International Geophysical Year, an international scientific project that lasted from July 1, 1957, to December 31, 1958, provided scientists with an alternative model: a synoptic collection of observational data on a global scale.

This new, potentially complementary model of Big Science encompassed multiple fields of practice and relied heavily on the sharing of large data sets that spanned multiple disciplines. The change in data gathering techniques, analysis, and collaboration also helped to redefine how Big Science projects are planned and accomplished. Most important, the International Geophysical Year project laid the foundation for more ambitious projects that gathered more specialized data for specific analysis, such as the International Biological Program and later the Long-Term Ecological Research Network. Both increased the mountains of data gathered, incorporated newer analysis technologies, and pushed IT technology further into the spotlight.

The International Biological Program encountered difficulties when the institutional structures, research methodologies, and data management implied by the Big Science mode of research collided with the epistemic goals, practices, and assumptions of many of the scientists involved. By 1974, when the program ended, many participants viewed it as a failure.

Nevertheless, what many viewed as a failure really was a success. The program transformed the way data were collected, shared, and analyzed and redefined how IT can be used for data analysis. Historical analysis suggests that many of the original incentives of the program (such as the emphasis on Big Data and the implementation of the organizational structure of Big Science) were in fact realized by the program’s visionaries and its immediate investigators. Even though the program failed to follow the exact model of the International Geophysical Year, it ultimately succeeded in providing a renewed legitimacy for synoptic data collection.

The lessons learned from the birth of Big Science spawned new Big Data projects: weather prediction, physics research (supercollider data analytics), astronomy images (planet detection), medical research (drug interaction), and many others. Of course, Big Data doesn’t apply only to science; businesses have latched onto its techniques, methodologies, and objectives, too. This has allowed the businesses to uncover value in data that might previously have been overlooked.

BIG DATA: THE MODERN ERA

Big Science may have led to the birth of Big Data, but it was Big Business that brought Big Data through its adolescence into the modern era. Big Science and Big Business differ on many levels, of course, especially in analytics. Big Science uses Big Data to answer questions or prove theories, while Big Business uses Big Data to discover new opportunities, measure efficiencies, or uncover relationships among what was thought to be unrelated data sets.

Nonetheless, both use algorithms to mine data, and both have to have technologies to work with mountains of data. But the similarities end there. Big Science gathers data based on experiments and research conducted in controlled environments. Big Business gathers data from sources that are transactional in nature and that often have little control over the origin of the data.

For Big Business, and businesses of almost any size, there is an avalanche of data available that is increasing exponentially. Perhaps Google CEO Erik Schmidt said it best: “Every two days now we create as much information as we did from the dawn of civilization up until 2003. That’s something like five exabytes of data.” An exabyte is an incredibly large, almost unimaginable amount of information: 10 to the 18th power. Think of an exabyte as the number 1 followed by 18 zeros.

It is that massive amount of exponentially growing data that defines the future of Big Data. Once again, we may need to look at the scientific community to determine where Big Data is headed for the business world. Farnam Jahanian, the assistant director for computer and information science and engineering for the National Science Foundation (NSF), kicked off a May 1, 2012, briefing about Big Data on Capitol Hill by calling data “a transformative new currency for science, engineering, education, and commerce.” That briefing, which was organized by TechAmerica, brought together a panel of leaders from government and industry to discuss the opportunities for innovation arising from the collection, storage, analysis, and visualization of large, heterogeneous data sets, all the while taking into consideration the significant security and privacy implications.

Jahanian noted that “Big Data is characterized not only by the enormous volume of data but also by the diversity and heterogeneity of the data and the velocity of its generation,” the result of modern experimental methods, longitudinal observational studies, scientific instruments such as telescopes and particle accelerators, Internet transactions, and the widespread deployment of sensors all around us. In doing so, he set the stage for why Big Data is important to all facets of the IT discovery and innovation ecosystem, including the nation’s academic, government, industrial, entrepreneurial, and investment communities.

Jahanian further explained the implications of the modern era of Big Data with three specific points:

First, insights and more accurate predictions from large and complex collections of data have important implications for the economy. Access to information is transforming traditional businesses and is creating opportunities in new markets. Big Data is driving the creation of new IT products and services based on business intelligence and data analytics and is boosting the productivity of firms that use it to make better decisions and identify new business trends.

Second, advances in Big Data are critical to accelerate the pace of discovery in almost every science and engineering discipline. From new insights about protein structure, biomedical research and clinical decision making, and climate modeling to new ways to mitigate and respond to natural disasters and new strategies for effective learning and education, there are enormous opportunities for data-driven discovery.

Third, Big Data also has the potential to solve some of the nation’s most pressing challenges—in science, education, environment and sustainability, medicine, commerce, and cyber and national security—with enormous societal benefit and laying the foundations for U.S. competitiveness for many decades to come.

Jahanian shared the President’s Council of Advisors on Science and Technology’s recent recommendation for the federal government to “increase R&D investments for collecting, storing, preserving, managing, analyzing, and sharing increased quantities of data,” because “the potential to gain new insights [by moving] from data to knowledge to action has tremendous potential to transform all areas of national priority.”

Partly in response to this recommendation, the White House Office of Science and Technology Policy, together with other agencies, announced a $200 million Big Data R&D initiative to advance core techniques and technologies. According to Jahanian, within this initiative, the NSF’s strategy for supporting the fundamental science and underlying infrastructure enabling Big Data science and engineering involves the following:

  • Advances in foundational techniques and technologies (i.e., new methods) to derive knowledge from data.
  • Cyberinfrastructure to manage, curate, and serve data to science and engineering research and education communities.
  • New approaches to education and workforce development.
  • Nurturance of new types of collaborations—multidisciplinary teams and communities enabled by new data access policies—to make advances in the grand challenges of the computation- and data-intensive world today.

Ultimately, Jahanian concluded, “realizing the enormous potential of Big Data requires a long-term, bold, sustainable, and comprehensive approach, not only by NSF but also throughout the government and our nation’s research institutions.”

The panel discussions that followed echoed many of Jahanian’s remarks. For example, Nuala O’Connor Kelly, the senior counsel for information governance and chief privacy leader at General Electric (GE), said, “For us, it’s the volume and velocity and variety of data [and the opportunity that’s presented for using] that data to achieve new results for the company and for our customers and clients [throughout the world].” She cited as an example that GE Healthcare collects and monitors maintenance data from its machines deployed worldwide and can automatically ship replacement parts just days in advance of their malfunctioning, based on the analytics of machine functionality. “Much of [this] is done remotely and at tremendous cost savings,” she said.

Caron Kogan, the strategic planning director at Lockheed Martin, and Flavio Villanustre, the vice president of technology at LexisNexis Risk Solutions, described similar pursuits within their companies—particularly in intelligence and fraud prevention, respectively.

GE’s Kelly touched on privacy aspects. “Control may no longer be about not having the data at all,” she pointed out. “A potentially more efficient solution is one of making sure there are appropriate controls technologically and processes and policies and laws in place and then ensuring appropriate enforcement.” She emphasized striking the right balance between policies that ensure the protection of individuals and those that enable technological innovation and economic growth.

Bill Perlowitz, the chief technology officer in Wyle Laboratories’s science, technology, and engineering group, referenced a paradigm shift in scientific exploration:

Before, if you had an application or software, you had value; now that value is going to be in the data. For scientists that represents a shift from [hypothesis-driven] science to data-driven research. Hypothesis-driven science limits your exploration to what you can imagine, and the human mind . . . can only go so far. Data-driven science allows us to collect data and then see what it tells us, and we don’t have a pretense that we may understand what those relationships are and what we may find. So for a research scientist, these kinds of changes are very exciting and something we’ve been trying to get to for some time now.

Perhaps Nick Combs, the federal chief technology officer at EMC Corporation, summed it up best when describing the unprecedented growth in data: “It’s [no longer about finding a] needle in a haystack or connecting the dots. That’s child’s play.”

What all of this means is that the value of Big Data and the transformation of the ideologies and technologies are already here. The government and scientific communities are preparing themselves for the next evolution of Big Data and are planning how to address the new challenges and figure out better ways to leverage the data.

TODAY, TOMORROW, AND THE NEXT DAY

As the amount of data gathered grows exponentially, so does the evolution of the technology used to process the data. According to the International Data Corporation, the volume of digital content in the world will grow to 2.7 billion terabytes in 2012, up 48 percent from 2011, and will reach 8 billion terabytes by 2015. That will be a lot of data!

The flood of data is coming from both structured corporate databases and unstructured data from Web pages, blogs, social networking messages, and other sources. Currently, for example, there are countless digital sensors worldwide in industrial equipment, automobiles, electrical meters, and shipping crates. Those sensors can measure and communicate location, movement, vibration, temperature, humidity, and even chemical changes in the air. Today, Big Business wields data like a weapon. Giant retailers, such as Walmart and Kohl’s, analyze sales, pricing, economic, demographic, and weather data to tailor product selections at particular stores and determine the timing of price markdowns.

Logistics companies like United Parcel Service mine data on truck delivery times and traffic patterns to fine-tune routing. A whole ecosystem of new businesses and technologies is springing up to engage with this new reality: companies that store data, companies that mine data for insight, and companies that aggregate data to make them manageable. However, it is an ecosystem that is still emerging, and its exact shape has yet to make itself clear.

Even though Big Data has been around for some time, one of the biggest challenges of working with it still remains, and that is assembling data and preparing them for analysis. Different systems store data in different formats, even within the same company. Assembling, standardizing, and cleaning data of irregularities—all without removing the information that makes them valuable—remain a central challenge.

Currently, Hadoop, an open source software framework derived from Google’s MapReduce and Google File System papers, is being used by several technology vendors to do just that. Hadoop maps tasks across a cluster of machines, splitting them into smaller subtasks, before reducing the results into one master calculation. It’s really an old grid-computing technique given new life in the age of cloud computing. Many of the challenges of yesterday remain today, and technology is just now catching up with the demands of Big Data analytics. However, Big Data remains a moving target.

As the future brings more challenges, it will also deliver more solutions, and Big Data has a bright future, with tomorrow delivering the technologies that ease leveraging the data. For example, Hadoop is converging with other technology advances such as high-speed data analysis, made possible by parallel computing, in-memory processing, and lower-cost flash memory in the form of solid-state drives.

The prospect of being able to process troves of data very quickly, in-memory, without time-consuming forays to retrieve information stored on disk drives, will be a major enabler, and this will allow companies to assemble, sort, and analyze data much more rapidly. For example, T-Mobile is using SAP’s HANA to mine data on its 30 million U.S. customers from stores, text messages, and call centers to tailor personalized deals.

What used to take T-Mobile a week to accomplish can now be done in three hours with the SAP system. Organizations that can utilize this capability to make faster and more informed business decisions will have a distinct advantage over competitors. In a short period of time, Hadoop has transitioned from relative obscurity as a consumer Internet project into the mainstream consciousness of enterprise IT.

Hadoop is designed to handle mountains of unstructured data. However, as it exists, the open source code is a long way from meeting enterprise requirements for security, management, and efficiency without some serious customization. Enterprise-scale Hadoop deployments require costly IT specialists who are capable of guiding a lot of somewhat disjointed processes. That currently limits adoption to organizations with substantial IT budgets.

As tomorrow delivers refined platforms, Hadoop and its derivatives will start to fit into the enterprise as a complement to existing data analytics and data warehousing tools, available from established business process vendors, such as Oracle, HP, and SAP. The key will be to make Hadoop much more accessible to enterprises of all sizes, which can be accomplished by creating high availability platforms that take much of the complexity out of assembling and preparing huge amounts of data for analysis.

Aggregating multiple steps into a streamlined automated process with significantly enhanced security will prove to be the catalyst that drives Big Data from today to tomorrow. Add those enhancements to new technologies, such as appliances, and the momentum should continue to pick up, thanks to easy management through user-friendly GUI.

The true value of Big Data lies in the amount of useful data that can be derived from it. The future of Big Data is therefore to do for data and analytics what Moore’s Law has done for computing hardware and exponentially increase the speed and value of business intelligence. Whether the need is to link geography and retail availability, use patient data to forecast public health trends, or analyze global climate trends, we live in a world full of data. Effectively harnessing Big Data will give businesses a whole new lens through which to see it.

However, the advance of Big Data technology doesn’t stop with tomorrow. Beyond tomorrow probably holds surprises that no one has even imagined yet. As technology marches ahead, so will the usefulness of Big Data. A case in point is IBM’s Watson, an artificial intelligence computer system capable of answering questions posed in natural language. In 2011, as a test of its abilities, Watson competed on the quiz show Jeopardy!, in the show’s only human-versus-machine match to date. In a two-game, combined-point match, broadcast in three episodes aired February 14–16, Watson beat Brad Rutter, the biggest all-time money winner on Jeopardy!, and Ken Jennings, the record holder for the longest championship streak (74 wins).

Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia, but was not connected to the Internet during the game. Watson demonstrated that there are new ways to deal with Big Data and new ways to measure results, perhaps exemplifying where Big Data may be headed.

So what’s next for Watson? IBM has stated publicly that Watson was a client-driven initiative, and the company intends to push Watson in directions that best serve customer needs. IBM is now working with financial giant Citi to explore how the Watson technology could improve and simplify the banking experience. Watson’s applicability doesn’t end with banking, however; IBM has also teamed up with health insurer WellPoint to turn Watson into a machine that can support the doctors of the world.

According to IBM, Watson is best suited for use cases involving critical decision making based on large volumes of unstructured data. To drive the Big Data–crunching message home, IBM has stated that 90 percent of the world’s data was created in the last two years, and 80 percent of that data is unstructured. Furthering the value proposition of Watson and Big Data, IBM has also stated that five new research documents come out of Wall Street every minute, and medical information is doubling every five years.

IBM views the future of Big Data a little differently than other vendors do, most likely based on its Watson research. In IBM’s future, Watson becomes a service—as IBM calls it, Watson-as-a-Service—which will be delivered as a private or hybrid cloud service.

Watson aside, the health care industry seems ripe as a source of prediction for how Big Data will evolve. Examples abound for the benefits of Big Data and the medical field; however, getting there is another story altogether. Health care (or in this context, “Big Medicine”) has some specific challenges to overcome and some specific goals to achieve to realize the potential of Big Data:

  • Big Medicine is drowning in information while also dying of thirst. For those in the medical profession, that axiom can be summed up with a situation that most medical personnel face: When you’re in the institution and you’re trying to figure out what’s going on and how to report on something, you’re dying of thirst in a sea of information. There is a tremendous amount of information, so much so that it becomes a Big Data problem. How does one tap into that information and make sense of it? The answer has implications not only for patients but also for the service providers, ranging from nurses, physicians, and hospital administrators, even to government and insurance agencies. The big issue is that the data are not organized; they are a mixture of structured and unstructured data. How the data will ultimately be handled over the next few years will be driven by the government, which will require a tremendous amount of information to be recorded for reporting purposes.
  • Technologies that tap into Big Data need to become more prevalent and even ubiquitous. From the patient’s perspective, analytics and Big Data will aid in determining which hospital in a patient’s immediate area is the best for treating his or her condition. Today there are a huge number of choices available, and most people choose by word of mouth, insurance requirements, doctor recommendations, and many other factors. Wouldn’t it make more sense to pick a facility based on report cards derived by analytics? That is the goal of the government, which wants patients to be able to look at a report card for various institutions. However, the only way to create that report card is to unlock all of the information and impose regulations and reporting. That will require various types of IT to tap into unstructured information, like dashboard technologies and analytics, business intelligence technologies, clinical intelligence technologies, and revenue cycle management intelligence for institutions.
  • Decision support needs to be easier to access. Currently in medical institutions, evidence-based medicine and decision support is not as easy to access as it should be. Utilizing Big Data analytics will make the decision process easier and will provide the hard evidence to validate a particular decision path. For example, when a patient is suffering from a particular condition, there’s a high potential that something is going to happen to that patient because of his or her history. The likely outcomes or progressions can be brought up at the beginning of the care cycle, and the treating physician can be informed immediately. Information like that and much more will come from the Big Data analytics process.
  • Information needs to flow more easily. Currently from a patient’s perspective, health care today limits information. Patients often have little perspective on what exactly is happening, at least until a physician comes in. However, the majority of patients are apprehensive about talking to the physician. That becomes an informational blockade for both the physician and the patient and creates a situation in which it becomes more difficult for both physicians and patients to make choices. Big Data has the potential to solve that problem as well; the flow of information will be easier not only for physicians to manage but also for patients to access. For example, physicians will be able to look on their tablets or smartphones and see there is a 15-minute emergency-room wait over here and a 5-minute wait over there. Scheduling, diagnostic support, and evidence-based medicine support in the work flow will improve.
  • Quality of care needs to be increased while driving costs down. From a cost perspective and a quality-of-care point of view, there are a number of different areas that can be improved by Big Data. For example, if a patient experiences an injury while staying in a hospital, the hospital will not be reimbursed for his or her care. The system can see that this has the potential to happen and can alert everyone. Big Data can enable a proactive approach for care that reduces accidents or other problems that affect the quality of care. By preventing problems and accidents, Big Data can yield significant savings.
  • The physician–patient relationship needs to improve. Thanks to social media and mobile applications, which are benefiting from Big Data techniques, it is becoming easier to research health issues and allow patients and physicians to communicate more frequently. Stored data and unstructured data can be analyzed against social data to identify health trends. That information can then be used by hospitals to keep patients healthier and out of the facility. In the past, hospitals made more money the sicker a patient was and the longer they kept him or her there. However, with health care reform, hospitals are going to start being compensated for keeping patients healthier. Because of that there will be an explosion of mobile applications and even social media, allowing patients to have easier access to nurses and physicians. Health care is undergoing a transformation in which the focus is more on keeping patients healthy and driving down costs. These two major areas are going to drive a great deal of change, and a lot of evolution will take place from a health information technology point of view, all underpinned by the availability of data.

Health care proves that Big Data has definite value and will arguably be the leader in Big Data developments. However, the lessons learned by the health care industry can readily be applied to other business models, because Big Data is all about knowing how to utilize and analyze data to fit specific needs.

CHANGING ALGORITHMS

Just as data evolve and force the evolution of Big Data platforms, the very basic elements of analytics evolve as well. Most approaches to dealing with large data sets within a classification learning paradigm attempt to increase computational efficiency. Given the same amount of time, a more efficient algorithm can explore more of the hypothesis space than a less efficient algorithm. If the hypothesis space contains an optimal solution, a more efficient algorithm has a greater chance of finding that solution (assuming the hypothesis space cannot be exhaustively searched within a reasonable time). It is that desire for efficiency and speed that is forcing the evolution of algorithms and the supporting systems that run them.

However, a more efficient algorithm results in more searching or a faster search, not a better search. If the learning biases of the algorithm are inappropriate, an increase in computational efficiency may not equal an improvement in prediction performance. Therein lies the problem: More efficient algorithms normally do not lead to additional insights, just improved performance. However, improving the performance of a Big Data analytics platform increases the amount of data that can be analyzed, and that may lead to new insights.

The trick here is to create new algorithms that are more flexible, that incorporate machine learning techniques, and that remove the bias from analysis. Computer systems are now becoming powerful and subtle enough to help reduce human biases from our decision making. And this is the key: Computers can do it in real time. That will inevitably transform the objective observer concept into an organic, evolving database.

Today these systems can chew through billions of bits of data, analyze them via self-learning algorithms, and package the insights for immediate use. Neither we nor the computers are perfect, but in tandem we might neutralize our biased, intuitive failings when we price a car, prescribe a medicine, or deploy a sales force.

In the real world, accurate algorithms will translate to fewer hunches and more facts. Take, for example, the banking and mortgage market, where even the most knowledgeable human can quickly be outdone by an algorithm. Big Data systems are now of such scale that they can analyze the value of tens of thousands of mortgage-backed securities by picking apart the ongoing, dynamic creditworthiness of tens of millions of individual home owners. Such a system has already been built for Wall Street traders.

By crunching billions of data points about traffic flows, an algorithm might find that on Fridays a delivery fleet should stick to the highways, despite the gut instinct of a dispatcher for surface road shortcuts.

Big Data is at an evolutionary juncture where human judgment can be improved or even replaced by machines. That may sound ominous, but the same systems are already predicting hurricanes, warning of earthquakes, and mapping tornadoes.

Businesses are seeing the value, and the systems and algorithms are starting to supplement human judgment and are even on a path to replace it, in some cases. Until recently, however, businesses have been thwarted by the cost of storage, slower processing speeds, and the flood of the data themselves, spread sloppily across scores of different databases inside one company.

With technology and pricing points now solving those problems, the evolution of algorithms and Big Data platforms is bound to accelerate and change the very way we do predictive analysis, research, and even business.