To truly understand the implications of Big Data analytics, one has to reach back into the annals of computing history, specifically business intelligence (BI) and scientific computing. The ideology behind Big Data can most likely be tracked back to the days before the age of computers, when unstructured data were the norm (paper records) and analytics was in its infancy. Perhaps the first Big Data challenge came in the form of the 1880 U.S. census, when the information concerning approximately 50 million people had to be gathered, classified, and reported on.
With the 1880 census, just counting people was not enough information for the U.S. government to work with—particular elements, such as age, sex, occupation, education level, and even the “number of insane people in household,” had to be accounted for. That information had intrinsic value to the process, but only if it could be tallied, tabulated, analyzed, and presented. New methods of relating the data to other data collected came into being, such as associating occupations with geographic areas, birth rates with education levels, and countries of origin with skill sets.
The 1880 census truly yielded a mountain of data to deal with, yet only severely limited technology was available to do any of the analytics. The problem of Big Data could not be solved for the 1880 census, so it took over seven years to manually tabulate and report on the data.
With the 1890 census, things began to change, thanks to the introduction of the first Big Data platform: a mechanical device called the Hollerith Tabulating System, which worked with punch cards that could hold about 80 variables. The Hollerith Tabulating System revolutionized the value of census data, making it actionable and increasing its value an untold amount. Analysis now took six weeks instead of seven years. That allowed the government to act on information in a reasonable amount of time.
The census example points out a common theme with data analytics: Value can be derived only by analyzing data in a time frame in which action can still be taken to utilize the information uncovered. For the U.S. government, the ability to analyze the 1890 census led to an improved understanding of the populace, which the government could use to shape economic and social policies ranging from taxation to education to military conscription.
In today’s world, the information contained in the 1890 census would no longer be considered Big Data, according to the definition: data sets so large that common technology cannot accommodate and process them. Today’s desktop computers certainly have enough horsepower to process the information contained in the 1890 census by using a simple relational database and some basic code.
That realization transforms what Big Data is all about. Big Data involves having more data than you can handle with the computing power you already have, and you cannot easily scale your current computing environment to address the data. The definition of Big Data therefore continues to evolve with time and advances in technology. Big Data will always remain a paradigm shift in the making.
That said, the momentum behind Big Data continues to be driven by the realization that large unstructured data sources, such as those from the 1890 census, can deliver almost immeasurable value. The next giant leap for Big Data analytics came with the Manhattan Project, the U.S. development of the atomic bomb during World War II. The Manhattan Project not only introduced the concept of Big Data analysis with computers, it was also the catalyst for “Big Science,” which in turn depends on Big Data analytics for success. The next largest Big Science project began in the late 1950s with the launch of the U.S. space program.
As the term Big Science gained currency in the 1960s, the Manhattan Project and the space program became paradigmatic examples. However, the International Geophysical Year, an international scientific project that lasted from July 1, 1957, to December 31, 1958, provided scientists with an alternative model: a synoptic collection of observational data on a global scale.
This new, potentially complementary model of Big Science encompassed multiple fields of practice and relied heavily on the sharing of large data sets that spanned multiple disciplines. The change in data gathering techniques, analysis, and collaboration also helped to redefine how Big Science projects are planned and accomplished. Most important, the International Geophysical Year project laid the foundation for more ambitious projects that gathered more specialized data for specific analysis, such as the International Biological Program and later the Long-Term Ecological Research Network. Both increased the mountains of data gathered, incorporated newer analysis technologies, and pushed IT technology further into the spotlight.
The International Biological Program encountered difficulties when the institutional structures, research methodologies, and data management implied by the Big Science mode of research collided with the epistemic goals, practices, and assumptions of many of the scientists involved. By 1974, when the program ended, many participants viewed it as a failure.
Nevertheless, what many viewed as a failure really was a success. The program transformed the way data were collected, shared, and analyzed and redefined how IT can be used for data analysis. Historical analysis suggests that many of the original incentives of the program (such as the emphasis on Big Data and the implementation of the organizational structure of Big Science) were in fact realized by the program’s visionaries and its immediate investigators. Even though the program failed to follow the exact model of the International Geophysical Year, it ultimately succeeded in providing a renewed legitimacy for synoptic data collection.
The lessons learned from the birth of Big Science spawned new Big Data projects: weather prediction, physics research (supercollider data analytics), astronomy images (planet detection), medical research (drug interaction), and many others. Of course, Big Data doesn’t apply only to science; businesses have latched onto its techniques, methodologies, and objectives, too. This has allowed the businesses to uncover value in data that might previously have been overlooked.
Big Science may have led to the birth of Big Data, but it was Big Business that brought Big Data through its adolescence into the modern era. Big Science and Big Business differ on many levels, of course, especially in analytics. Big Science uses Big Data to answer questions or prove theories, while Big Business uses Big Data to discover new opportunities, measure efficiencies, or uncover relationships among what was thought to be unrelated data sets.
Nonetheless, both use algorithms to mine data, and both have to have technologies to work with mountains of data. But the similarities end there. Big Science gathers data based on experiments and research conducted in controlled environments. Big Business gathers data from sources that are transactional in nature and that often have little control over the origin of the data.
For Big Business, and businesses of almost any size, there is an avalanche of data available that is increasing exponentially. Perhaps Google CEO Erik Schmidt said it best: “Every two days now we create as much information as we did from the dawn of civilization up until 2003. That’s something like five exabytes of data.” An exabyte is an incredibly large, almost unimaginable amount of information: 10 to the 18th power. Think of an exabyte as the number 1 followed by 18 zeros.
It is that massive amount of exponentially growing data that defines the future of Big Data. Once again, we may need to look at the scientific community to determine where Big Data is headed for the business world. Farnam Jahanian, the assistant director for computer and information science and engineering for the National Science Foundation (NSF), kicked off a May 1, 2012, briefing about Big Data on Capitol Hill by calling data “a transformative new currency for science, engineering, education, and commerce.” That briefing, which was organized by TechAmerica, brought together a panel of leaders from government and industry to discuss the opportunities for innovation arising from the collection, storage, analysis, and visualization of large, heterogeneous data sets, all the while taking into consideration the significant security and privacy implications.
Jahanian noted that “Big Data is characterized not only by the enormous volume of data but also by the diversity and heterogeneity of the data and the velocity of its generation,” the result of modern experimental methods, longitudinal observational studies, scientific instruments such as telescopes and particle accelerators, Internet transactions, and the widespread deployment of sensors all around us. In doing so, he set the stage for why Big Data is important to all facets of the IT discovery and innovation ecosystem, including the nation’s academic, government, industrial, entrepreneurial, and investment communities.
Jahanian further explained the implications of the modern era of Big Data with three specific points:
First, insights and more accurate predictions from large and complex collections of data have important implications for the economy. Access to information is transforming traditional businesses and is creating opportunities in new markets. Big Data is driving the creation of new IT products and services based on business intelligence and data analytics and is boosting the productivity of firms that use it to make better decisions and identify new business trends.
Second, advances in Big Data are critical to accelerate the pace of discovery in almost every science and engineering discipline. From new insights about protein structure, biomedical research and clinical decision making, and climate modeling to new ways to mitigate and respond to natural disasters and new strategies for effective learning and education, there are enormous opportunities for data-driven discovery.
Third, Big Data also has the potential to solve some of the nation’s most pressing challenges—in science, education, environment and sustainability, medicine, commerce, and cyber and national security—with enormous societal benefit and laying the foundations for U.S. competitiveness for many decades to come.
Jahanian shared the President’s Council of Advisors on Science and Technology’s recent recommendation for the federal government to “increase R&D investments for collecting, storing, preserving, managing, analyzing, and sharing increased quantities of data,” because “the potential to gain new insights [by moving] from data to knowledge to action has tremendous potential to transform all areas of national priority.”
Partly in response to this recommendation, the White House Office of Science and Technology Policy, together with other agencies, announced a $200 million Big Data R&D initiative to advance core techniques and technologies. According to Jahanian, within this initiative, the NSF’s strategy for supporting the fundamental science and underlying infrastructure enabling Big Data science and engineering involves the following:
Ultimately, Jahanian concluded, “realizing the enormous potential of Big Data requires a long-term, bold, sustainable, and comprehensive approach, not only by NSF but also throughout the government and our nation’s research institutions.”
The panel discussions that followed echoed many of Jahanian’s remarks. For example, Nuala O’Connor Kelly, the senior counsel for information governance and chief privacy leader at General Electric (GE), said, “For us, it’s the volume and velocity and variety of data [and the opportunity that’s presented for using] that data to achieve new results for the company and for our customers and clients [throughout the world].” She cited as an example that GE Healthcare collects and monitors maintenance data from its machines deployed worldwide and can automatically ship replacement parts just days in advance of their malfunctioning, based on the analytics of machine functionality. “Much of [this] is done remotely and at tremendous cost savings,” she said.
Caron Kogan, the strategic planning director at Lockheed Martin, and Flavio Villanustre, the vice president of technology at LexisNexis Risk Solutions, described similar pursuits within their companies—particularly in intelligence and fraud prevention, respectively.
GE’s Kelly touched on privacy aspects. “Control may no longer be about not having the data at all,” she pointed out. “A potentially more efficient solution is one of making sure there are appropriate controls technologically and processes and policies and laws in place and then ensuring appropriate enforcement.” She emphasized striking the right balance between policies that ensure the protection of individuals and those that enable technological innovation and economic growth.
Bill Perlowitz, the chief technology officer in Wyle Laboratories’s science, technology, and engineering group, referenced a paradigm shift in scientific exploration:
Before, if you had an application or software, you had value; now that value is going to be in the data. For scientists that represents a shift from [hypothesis-driven] science to data-driven research. Hypothesis-driven science limits your exploration to what you can imagine, and the human mind . . . can only go so far. Data-driven science allows us to collect data and then see what it tells us, and we don’t have a pretense that we may understand what those relationships are and what we may find. So for a research scientist, these kinds of changes are very exciting and something we’ve been trying to get to for some time now.
Perhaps Nick Combs, the federal chief technology officer at EMC Corporation, summed it up best when describing the unprecedented growth in data: “It’s [no longer about finding a] needle in a haystack or connecting the dots. That’s child’s play.”
What all of this means is that the value of Big Data and the transformation of the ideologies and technologies are already here. The government and scientific communities are preparing themselves for the next evolution of Big Data and are planning how to address the new challenges and figure out better ways to leverage the data.
As the amount of data gathered grows exponentially, so does the evolution of the technology used to process the data. According to the International Data Corporation, the volume of digital content in the world will grow to 2.7 billion terabytes in 2012, up 48 percent from 2011, and will reach 8 billion terabytes by 2015. That will be a lot of data!
The flood of data is coming from both structured corporate databases and unstructured data from Web pages, blogs, social networking messages, and other sources. Currently, for example, there are countless digital sensors worldwide in industrial equipment, automobiles, electrical meters, and shipping crates. Those sensors can measure and communicate location, movement, vibration, temperature, humidity, and even chemical changes in the air. Today, Big Business wields data like a weapon. Giant retailers, such as Walmart and Kohl’s, analyze sales, pricing, economic, demographic, and weather data to tailor product selections at particular stores and determine the timing of price markdowns.
Logistics companies like United Parcel Service mine data on truck delivery times and traffic patterns to fine-tune routing. A whole ecosystem of new businesses and technologies is springing up to engage with this new reality: companies that store data, companies that mine data for insight, and companies that aggregate data to make them manageable. However, it is an ecosystem that is still emerging, and its exact shape has yet to make itself clear.
Even though Big Data has been around for some time, one of the biggest challenges of working with it still remains, and that is assembling data and preparing them for analysis. Different systems store data in different formats, even within the same company. Assembling, standardizing, and cleaning data of irregularities—all without removing the information that makes them valuable—remain a central challenge.
Currently, Hadoop, an open source software framework derived from Google’s MapReduce and Google File System papers, is being used by several technology vendors to do just that. Hadoop maps tasks across a cluster of machines, splitting them into smaller subtasks, before reducing the results into one master calculation. It’s really an old grid-computing technique given new life in the age of cloud computing. Many of the challenges of yesterday remain today, and technology is just now catching up with the demands of Big Data analytics. However, Big Data remains a moving target.
As the future brings more challenges, it will also deliver more solutions, and Big Data has a bright future, with tomorrow delivering the technologies that ease leveraging the data. For example, Hadoop is converging with other technology advances such as high-speed data analysis, made possible by parallel computing, in-memory processing, and lower-cost flash memory in the form of solid-state drives.
The prospect of being able to process troves of data very quickly, in-memory, without time-consuming forays to retrieve information stored on disk drives, will be a major enabler, and this will allow companies to assemble, sort, and analyze data much more rapidly. For example, T-Mobile is using SAP’s HANA to mine data on its 30 million U.S. customers from stores, text messages, and call centers to tailor personalized deals.
What used to take T-Mobile a week to accomplish can now be done in three hours with the SAP system. Organizations that can utilize this capability to make faster and more informed business decisions will have a distinct advantage over competitors. In a short period of time, Hadoop has transitioned from relative obscurity as a consumer Internet project into the mainstream consciousness of enterprise IT.
Hadoop is designed to handle mountains of unstructured data. However, as it exists, the open source code is a long way from meeting enterprise requirements for security, management, and efficiency without some serious customization. Enterprise-scale Hadoop deployments require costly IT specialists who are capable of guiding a lot of somewhat disjointed processes. That currently limits adoption to organizations with substantial IT budgets.
As tomorrow delivers refined platforms, Hadoop and its derivatives will start to fit into the enterprise as a complement to existing data analytics and data warehousing tools, available from established business process vendors, such as Oracle, HP, and SAP. The key will be to make Hadoop much more accessible to enterprises of all sizes, which can be accomplished by creating high availability platforms that take much of the complexity out of assembling and preparing huge amounts of data for analysis.
Aggregating multiple steps into a streamlined automated process with significantly enhanced security will prove to be the catalyst that drives Big Data from today to tomorrow. Add those enhancements to new technologies, such as appliances, and the momentum should continue to pick up, thanks to easy management through user-friendly GUI.
The true value of Big Data lies in the amount of useful data that can be derived from it. The future of Big Data is therefore to do for data and analytics what Moore’s Law has done for computing hardware and exponentially increase the speed and value of business intelligence. Whether the need is to link geography and retail availability, use patient data to forecast public health trends, or analyze global climate trends, we live in a world full of data. Effectively harnessing Big Data will give businesses a whole new lens through which to see it.
However, the advance of Big Data technology doesn’t stop with tomorrow. Beyond tomorrow probably holds surprises that no one has even imagined yet. As technology marches ahead, so will the usefulness of Big Data. A case in point is IBM’s Watson, an artificial intelligence computer system capable of answering questions posed in natural language. In 2011, as a test of its abilities, Watson competed on the quiz show Jeopardy!, in the show’s only human-versus-machine match to date. In a two-game, combined-point match, broadcast in three episodes aired February 14–16, Watson beat Brad Rutter, the biggest all-time money winner on Jeopardy!, and Ken Jennings, the record holder for the longest championship streak (74 wins).
Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia, but was not connected to the Internet during the game. Watson demonstrated that there are new ways to deal with Big Data and new ways to measure results, perhaps exemplifying where Big Data may be headed.
So what’s next for Watson? IBM has stated publicly that Watson was a client-driven initiative, and the company intends to push Watson in directions that best serve customer needs. IBM is now working with financial giant Citi to explore how the Watson technology could improve and simplify the banking experience. Watson’s applicability doesn’t end with banking, however; IBM has also teamed up with health insurer WellPoint to turn Watson into a machine that can support the doctors of the world.
According to IBM, Watson is best suited for use cases involving critical decision making based on large volumes of unstructured data. To drive the Big Data–crunching message home, IBM has stated that 90 percent of the world’s data was created in the last two years, and 80 percent of that data is unstructured. Furthering the value proposition of Watson and Big Data, IBM has also stated that five new research documents come out of Wall Street every minute, and medical information is doubling every five years.
IBM views the future of Big Data a little differently than other vendors do, most likely based on its Watson research. In IBM’s future, Watson becomes a service—as IBM calls it, Watson-as-a-Service—which will be delivered as a private or hybrid cloud service.
Watson aside, the health care industry seems ripe as a source of prediction for how Big Data will evolve. Examples abound for the benefits of Big Data and the medical field; however, getting there is another story altogether. Health care (or in this context, “Big Medicine”) has some specific challenges to overcome and some specific goals to achieve to realize the potential of Big Data:
Health care proves that Big Data has definite value and will arguably be the leader in Big Data developments. However, the lessons learned by the health care industry can readily be applied to other business models, because Big Data is all about knowing how to utilize and analyze data to fit specific needs.
Just as data evolve and force the evolution of Big Data platforms, the very basic elements of analytics evolve as well. Most approaches to dealing with large data sets within a classification learning paradigm attempt to increase computational efficiency. Given the same amount of time, a more efficient algorithm can explore more of the hypothesis space than a less efficient algorithm. If the hypothesis space contains an optimal solution, a more efficient algorithm has a greater chance of finding that solution (assuming the hypothesis space cannot be exhaustively searched within a reasonable time). It is that desire for efficiency and speed that is forcing the evolution of algorithms and the supporting systems that run them.
However, a more efficient algorithm results in more searching or a faster search, not a better search. If the learning biases of the algorithm are inappropriate, an increase in computational efficiency may not equal an improvement in prediction performance. Therein lies the problem: More efficient algorithms normally do not lead to additional insights, just improved performance. However, improving the performance of a Big Data analytics platform increases the amount of data that can be analyzed, and that may lead to new insights.
The trick here is to create new algorithms that are more flexible, that incorporate machine learning techniques, and that remove the bias from analysis. Computer systems are now becoming powerful and subtle enough to help reduce human biases from our decision making. And this is the key: Computers can do it in real time. That will inevitably transform the objective observer concept into an organic, evolving database.
Today these systems can chew through billions of bits of data, analyze them via self-learning algorithms, and package the insights for immediate use. Neither we nor the computers are perfect, but in tandem we might neutralize our biased, intuitive failings when we price a car, prescribe a medicine, or deploy a sales force.
In the real world, accurate algorithms will translate to fewer hunches and more facts. Take, for example, the banking and mortgage market, where even the most knowledgeable human can quickly be outdone by an algorithm. Big Data systems are now of such scale that they can analyze the value of tens of thousands of mortgage-backed securities by picking apart the ongoing, dynamic creditworthiness of tens of millions of individual home owners. Such a system has already been built for Wall Street traders.
By crunching billions of data points about traffic flows, an algorithm might find that on Fridays a delivery fleet should stick to the highways, despite the gut instinct of a dispatcher for surface road shortcuts.
Big Data is at an evolutionary juncture where human judgment can be improved or even replaced by machines. That may sound ominous, but the same systems are already predicting hurricanes, warning of earthquakes, and mapping tornadoes.
Businesses are seeing the value, and the systems and algorithms are starting to supplement human judgment and are even on a path to replace it, in some cases. Until recently, however, businesses have been thwarted by the cost of storage, slower processing speeds, and the flood of the data themselves, spread sloppily across scores of different databases inside one company.
With technology and pricing points now solving those problems, the evolution of algorithms and Big Data platforms is bound to accelerate and change the very way we do predictive analysis, research, and even business.