Chapter 3

Big Data and the Business Case

Big Data is quickly becoming more than just a buzzword. A plethora of organizations have made significant investments in the technology that surrounds Big Data and are currently starting to leverage the content within to find real value.

Even so, there is still a great deal of confusion about Big Data, similar to what many information technology (IT) managers have experienced in the past with disruptive technologies. Big Data is disruptive in the way that it changes how business intelligence (BI) is used in a business—and that is a scary proposition for many senior executives.

That situation puts chief technology officers, chief information officers, and IT managers in the unenviable position of trying to prove that a disruptive technology will actually improve business operations. Further complicating this situation is the high cost associated with in-house Big Data processing, as well as the security concerns that surround the processing of Big Data analytics off-site.

Perhaps some of the strife comes from the term Big Data itself. Nontechnical people may think of Big Data literally, as something associated with big problems and big costs. Presenting Big Data as “Big Analytics” instead may be the way to win over apprehensive decision makers while building a business case for the staff, technology, and results that Big Data relies upon.

The trick is to move beyond the accepted definition of Big Data—which implies that it is nothing more than data sets that have become too large to manage with traditional tools—and explain that Big Data is a combination of technologies that mines the value of large databases.

And large is the key word here, simply because massive amounts of data are being collected every second—more than ever imaginable—and the size of these data is greater than can be practically managed by today’s current strategies and technologies.

That has created a revolution in which Big Data has become centered on the tsunami of data and how it will change the execution of businesses processes. These changes include introducing greater efficiencies, building new processes for revenue discovery, and fueling innovation. Big Data has quickly grown from a new buzzword being tossed around technology circles into a practical definition for what it is really all about, Big Analytics.

REALIZING VALUE

A number of industries—including health care, the public sector, retail, and manufacturing—can obviously benefit from analyzing their rapidly growing mounds of data. Collecting and analyzing transactional data gives organizations more insight into their customers’ preferences, so the data can then be used as a basis for the creation of products and services. This allows the organizations to remedy emerging problems in a timely and more competitive manner.

The use of Big Data analytics is thus becoming a key foundation for competition and growth for individual firms, and it will most likely underpin new waves of productivity, growth, and consumer surplus.

THE CASE FOR BIG DATA

Building an effective business case for a Big Data project involves identifying several key elements that can be tied directly to a business process and are easy to understand as well as quantify. These elements are knowledge discovery, actionable information, short-term and long-term benefits, the resolution of pain points, and several others that are aligned with making a business process better by providing insight.

In most instances, Big Data is a disruptive element when introduced into an enterprise, and this disruption includes issues of scale, storage, and data center design. The disruption normally involves costs associated with hardware, software, staff, and support, all of which affect the bottom line. That means that return on investment (ROI) and total cost of ownership (TCO) are key elements of a Big Data business plan. The trick is to accelerate ROI while reducing TCO. The simplest way to do this is to associate a Big Data business plan with other IT projects driven by business needs.

While that might sound like a real challenge, businesses are actually investing in storage technologies and improved processing to meet other business goals, such as compliance, data archiving, cloud initiatives, and continuity planning. These initiatives can provide the foundation for a Big Data project, thanks to the two primary needs of Big Data: storage and processing.

Lately the natural growth of business IT solutions has been focused on processes that take on a distributed nature in which storage and applications are spread out over multiple systems and locations. This also proves to be a natural companion to Big Data, further helping to lay the foundation for Big Analytics.

Building a business case involves using case scenarios and providing supporting information. An extensive supply of examples exists, with several draft business cases, case scenarios, and other collateral, all courtesy of the major vendors involved with Big Data solutions. Notable vendors with massive amounts of collateral include IBM, Oracle, and HP.

While there is no set formula for building a business case, there are some critical elements that can be used to define how a business case should look, which helps to ensure the success of a Big Data project.

A solid business case for Big Data analytics should include the following:

  • The complete background of the project. This includes the drivers of the project, how others are using Big Data, what business processes Big Data will align with, and the overall goal of implementing the project.
  • Benefits analysis. It is often difficult to quantify the benefits of Big Data as static and tangible. Big Data analytics is all about the interpretation of data and the visualization of patterns, which amounts to a subjective analysis, highly dependent on humans to translate the results. However, that does not prevent a business case from including benefits driven by Big Data in nonsubjective terms (e.g., identifying sales trends, locating possible inventory shrinkage, quantifying shipping delays, or measuring customer satisfaction). The trick is to align the benefits of the project with the needs of a business process or requirement. An example of that would be to identify a business goal, such as 5 percent annual growth, and then show how Big Data analytics can help to achieve that goal.
  • Options. There are several paths to take to the destination of Big Data, ranging from in-house big iron solutions (data centers running large mainframe systems) to hosted offerings in the cloud to a hybrid of the two. It is important to research these options and identify how each may work for achieving Big Data analytics, as well as the pros and cons of each. Preferences and benefits should also be highlighted, allowing a financial decision to be tied to a technological decision.
  • Scope and costs. Scope is more of a management issue than a physical deployment issue. It all comes down to how the implementation scope affects the resources, especially personnel and staff. Scope questions should identify the who and the when of the project, in which personnel hours and technical expertise are defined, as well as the training and ancillary elements. Costs should also be associated with staffing and training issues, which helps to create the big picture for TCO calculations and provides the basis for accurate ROI calculations.
  • Risk analysis. Calculating risk can be a complex endeavor. However, since Big Data analytics is truly a business process that provides BI, risk calculations can include the cost of doing nothing compared to the benefits delivered by the technology. Other risks to consider are security implications (where the data live and who can access it), CPU overhead (whether the analytics will limit the processing power available for a line of business applications), compatibility and integration issues (whether the installation and operation will work with the existing technology), and disruption of business processes (installation creates downtime). All of these elements can be considered risks with a large-scale project and should be accounted for to build a solid business case.

Of course, the most critical theme of a business case is ROI. The return, or benefit, that an organization is likely to receive in relation to the cost of the project is a ratio that can change as more research is done and information is gathered while building a business case. Ideally, the ROI-to-cost ratio improves as more research is done and the business case writers discover additional value from the implementation of a Big Data analytics solution. Nevertheless, ROI is usually the most important factor in determining whether a project will ultimately go forward. The determination of ROI has become one of the primary reasons that companies and nonprofit organizations engage in the business case process in the first place.

THE RISE OF BIG DATA OPTIONS

Teradata, IBM, HP, Oracle, and many other companies have been offering terabyte-scale data warehouses for more than a decade, but those offerings were tuned for processes in which data warehousing was the primary goal. Today, data tend to be collected and stored in a wider variety of formats and can include structured, semistructured, and unstructured elements, which each tend to have different storage and management requirements. For Big Data analytics, data must be able to be processed in parallel across multiple servers. This is a necessity, given the amounts of information being analyzed.

In addition to having exhaustively maintained transactional data from databases and carefully culled data residing in data warehouses, organizations are reaping untold amounts of log data from servers and forms of machine-generated data, customer comments from internal and external social networks, and other sources of loose, unstructured data.

Such data sets are growing at an exponential rate, thanks to Moore’s Law. Moore’s Law states that the number of transistors that can be placed on a processor wafer doubles approximately every 18 months. Each new generation of processors is twice as powerful as its most recent predecessor. Similarly, the power of new servers also doubles every 18 months, which means their activities will generate correspondingly larger data sets.

The Big Data approach represents a major shift in how data are handled. In the past, carefully culled data were piped through the network to a data warehouse, where they could be further examined. However, as the volume of data increases, the network becomes a bottleneck. That is the kind of situation in which a distributed platform, such as Hadoop, comes into play. Distributed systems allow the analysis to occur where the data reside.

Traditional data systems are not able to handle Big Data effectively, either because those systems are not designed to handle the variety of today’s data, which tend to have much less structure, or because the data systems cannot scale quickly and affordably. Big Data analytics works very differently from traditional BI, which normally relies on a clean subset of user data placed in a data warehouse to be queried in a limited number of predetermined ways.

Big Data takes a very different approach, in which all of the data an organization generates are gathered and interacted with. That allows administrators and analysts to worry about how to use the data later. In that sense, Big Data solutions prove to be more scalable than traditional databases and data warehouses.

To understand how the options around Big Data have evolved, one must go back to the birth of Hadoop and the dawn of the Big Data movement. Hadoop’s roots can be traced back to a 2004 Google white paper that described the infrastructure Google built to analyze data on many different servers, using an indexing system called Bigtable. Google kept Bigtable for internal use, but Doug Cutting, a developer who had already created the Lucene and Solr open source search engine, created an open source version of Bigtable, naming the technology Hadoop after his son’s stuffed elephant.

One of Hadoop’s first adopters was Yahoo, which dedicated large amounts of engineering work to refine the technology around 2006. Yahoo’s primary challenge was to make sense of the vast amount of interesting data stored across separated systems. Unifying those data and analyzing them as a whole became a critical goal for Yahoo, and Hadoop turned out to be an ideal platform to make that happen. Today Yahoo is one of the biggest users of Hadoop and has deployed it on more than 40,000 servers.

The company uses the technology for multiple business cases and analytics chores. Yahoo’s Hadoop clusters hold massive log files of what stories and sections users click on; advertisement activity is also stored, as are lists of all of the content and articles Yahoo publishes. For Yahoo, Hadoop has proven to be well suited for searching for patterns in large sets of text.

BEYOND HADOOP

Another name to become familiar with in the Big Data realm is the Cassandra database, a technology that can store 2 million columns in a single row. That makes Cassandra ideal for appending more data onto existing user accounts without knowing ahead of time how the data should be formatted.

Cassandra’s roots can also be traced to an online service provider, in this case Facebook, which needed a massive distributed database to power the service’s inbox search. Like Yahoo, Facebook wanted to use the Google Bigtable architecture, which could provide a column- and row-oriented database structure that could be spread on a large number of nodes.

However, Bigtable had a serious limitation: It used a master node–oriented design. Bigtable depended on a single node to coordinate all read-and-write activities on all of the nodes. This meant that if the head node went down, the whole system would be useless.

Cassandra was built on a distributed architecture called Dynamo, which the Amazon engineers who developed it described in a 2007 white paper. Amazon uses Dynamo to keep track of what its millions of online customers are putting in their shopping carts.

Dynamo gave Cassandra an advantage over Bigtable, since Dynamo is not dependent on any one master node. Any node can accept data for the whole system, as well as answer queries. Data are replicated on multiple hosts, creating resiliency and eliminating the single point of failure.

WITH CHOICE COME DECISIONS

Many of the tools first developed by online service providers are becoming more available for enterprises as open source software. These days, Big Data tools are being tested by a wider range of organizations, beyond the large online service providers. Financial institutions, telecommunications, government agencies, utility companies, retail, and energy companies all are testing Big Data systems.

Naturally, more choices can make a decision harder, which is perhaps one of the biggest challenges associated with putting together a business plan that meets project needs while not introducing any additional uncertainty into the process. Ideally, a Big Data business plan will exemplify the primary goal of supporting both long-term strategic analysis and one-off transactional and behavioral analysis, which delivers both immediate benefits and long-term benefits.

While Hadoop is applicable to the majority of businesses, it is not the only game in town (at least when it comes to open source implementations). Once an organization has decided to leverage its heaps of machine-generated and social networking data, setting up the infrastructure will not be the biggest challenge. The biggest challenge may come from deciding to go it alone with an open source or to turn to one of the commercial implementations of Big Data technology. Vendors such as Cloudera, Hortonworks, and MapR are commercializing Big Data technologies, making them easier to deploy and manage.

Add to that the growing crop of Big Data on-demand services from cloud services providers, and the decision process becomes that much more complex. Decision makers will have to invest in research and perform due diligence to select the proper platform and implementation methodology to make a business plan successful. However, most of that legwork can be done during the business plan development phase, when the pros and cons of the various Big Data methodologies can be weighed and then measured against the overall goals of the business plan. Which technology will get there the fastest, with the lowest cost, and without mortgaging future capabilities?