Chapter 5

Big Data Sources

One of the biggest challenges for most organizations is finding data sources to use as part of their analytics processes. As the name implies, Big Data is large, but size is not the only concern. There are several other considerations when deciding how to locate and parse Big Data sets.

The first step is to identify usable data. While that may be obvious, it is anything but simple. Locating the appropriate data to push through an analytics platform can be complex and frustrating. The source must be considered to determine whether the data set is appropriate for use. That translates into detective work or investigative reporting.

Considerations should include the following:

  • Structure of the data (structured, unstructured, semistructured, table based, proprietary)
  • Source of the data (internal, external, private, public)
  • Value of the data (generic, unique, specialized)
  • Quality of the data (verified, static, streaming)
  • Storage of the data (remotely accessed, shared, dedicated platforms, portable)
  • Relationship of the data (superset, subset, correlated)

All of those elements and many others can affect the selection process and can have a dramatic effect on how the raw data are prepared (“scrubbed”) before the analytics process takes place.

In the IT realm, once a data source is located, the next step is to import the data into an appropriate platform. That process may be as simple as copying data onto a Hadoop cluster or as complicated as scrubbing, indexing, and importing the data into a large SQL-type table. That importation, or gathering of the data, is only one step in a multistep, sometimes complex process.

Once the importation (or real-time updating) has been performed, templates and scripts can be designed to ease further data-gathering chores. Once the process has been designed, it becomes easier to execute in the future.

Building a Big Data set ultimately serves one strategic purpose: to mine the data, or dig for something of value. Mining data involves a lot more than just running algorithms against a particular data source. Usually, the data have to be first imported into a platform that can deal with the data in an appropriate fashion. This means the data have to be transformed into something accessible, queryable, and relatable. Mining starts with a mine or, in Big Data parlance, a platform. Ultimately, to have any value, that platform must be populated with usable information.

HUNTING FOR DATA

Finding data for Big Data analytics is part science, part investigative work, and part assumption. Some of the most obvious sources for data are electronic transactions, web site logs, and sensor information. Any data the organization gathers while doing business are included. The idea is to locate as many data sources as possible and bring the data into an analytics platform. Additional data can be gathered using network taps and data replication clients. Ideally, the more data that can be captured, the more data there will be to work with.

Finding the internal data is the easy part of Big Data. It gets more complicated once data considered unrelated, external, or unstructured are bought into the equation. With that in mind, the big question with Big Data now is, “Where do I get the data from?” This is not easily answered; it takes some research to separate the wheat from the chaff, knowing that the chaff may have some value as well.

Setting out to build a Big Data warehouse takes a concentrated effort to find the appropriate data. The first step is to determine what Big Data analytics is going to be used for. For example, is the business looking to analyze marketing trends, predict web traffic, gauge customer satisfaction, or achieve some other lofty goal that can be accomplished with the current technologies?

It is this knowledge that will determine where and how to gather Big Data. Perhaps the best way to build such knowledge is to better understand the business analytics (BA) and business intelligence (BI) processes to determine how large-scale data sets can be used to interact with internal data to garner actionable results.

SETTING THE GOAL

Every project usually starts out with a goal and with objectives to reach that goal. Big Data analytics should be no different. However, defining the goal can be a difficult process, especially when the goal is vague and amounts to little more than something like “using the data better.” It is imperative to define the goal before hunting for data sources, and in many cases, proven examples of success can be the foundation for defining a goal.

Take, for example, a retail organization. The goal for Big Data analytics may be to increase sales, a chore that spans several business ideologies and departments, including marketing, pricing, inventory, advertising, and customer relations. Once there is a goal in mind, the next step is to define the objectives, the exact means by which to reach the goal.

For a project such as the retail example, it will be necessary to gather information from a multitude of sources, some internal and others external. Some of the data may have to be purchased, and some may be available under the public domain. The key is to start with the internal, structured data first, such as sales logs, inventory movement, registered transactions, customer information, pricing, and supplier interactions.

Next come the unstructured data, such as call center and support logs, customer feedback (perhaps e-mails and other communications), surveys, and data gathered by sensors (store traffic, parking lot usage). The list can include many other internally tracked elements; however, it is critical to be aware of diminishing returns on investment with the data sourced. In other words, some log information may not be worth the effort to gather, because it will not affect the analytics outcome.

Finally, external data must be taken into account. There is a vast wealth of external information that can be used to calculate everything from customer sentiments to geopolitical issues. The data that make up the public portion of the analytics process can come from government entities, research companies, social networking sites, and a multitude of other sources.

For example, a business may decide to mine Twitter, Facebook, the U.S. census, weather information, traffic pattern information, and news archives to build a complex source of rich data. Some controls need to be in place, and that may even include scrubbing the data before processing (i.e., removing spurious information or invalid elements).

The richness of the data is the basis for predictive analytics. A company looking to increase sales may compare population trends, along with social sentiment, to customer feedback and satisfaction to identify where the sales process could be improved. The data warehouse can be used for much more after the initial processing, and real-time data could also be integrated to identify trends as they arise.

The retail situation is only one example; there are dozens of others, each of which may have a specific applicability to the task at hand.

BIG DATA SOURCES GROWING

Multiple sources are responsible for a growth in data that is applcable to Big Data technology. Some of these sources represent entirely new data sources, while others are a change in the resolution of the existing data generated. Much of that growth can be attributed to industry digitization of content.

With companies now turning to creating digital representations of existing data and acquiring everything that is new, data growth rates over the last few years have been nearly infinite, simply because most of the businesses involved started from zero.

Many industries fall under the umbrella of new data creation and digitization of existing data, and most are becoming appropriate sources for Big Data resources. Those industries include the following:

  • Transportation, logistics, retail, utilities, and telecommunications. Sensor data are being generated at an accelerating rate from fleet GPS transceivers, RFID (radio-frequency identification) tag readers, smart meters, and cell phones (call data records); these data are used to optimize operations and drive operational BI to realize immediate business opportunities.
  • Health care. The health care industry is quickly moving to electronic medical records and images, which it wants to use for short-term public health monitoring and long-term epidemiological research programs.
  • Government. Many government agencies are digitizing public records, such as census information, energy usage, budgets, Freedom of Information Act documents, electoral data, and law enforcement reporting.
  • Entertainment media. The entertainment industry has moved to digital recording, production, and delivery in the past five years and is now collecting large amounts of rich content and user viewing behaviors.
  • Life sciences. Low-cost gene sequencing (less than $1,000) can generate tens of terabytes of information that must be analyzed to look for genetic variations and potential treatment effectiveness.
  • Video surveillance. Video surveillance is still transitioning from closed-caption television to Internet protocol television cameras and recording systems that organizations want to analyze for behavioral patterns (security and service enhancement).

For many businesses, the additional data can come from self-service marketplaces, which record the use of affinity cards and track the sites visited, and can be combined with social networks and location-based metadata. This creates a goldmine of actionable consumer data for retailers, distributors, and manufacturers of consumer packaged goods.

The legal profession is adding to the multitude of data sources, thanks to the discovery process, which is dealing more frequently with electronic records and requiring the digitization of paper documents for faster indexing and improved access. Today, leading e-discovery companies are handling terabytes or even petabytes of information that need to be retained and reanalyzed for the full course of a legal proceeding.

Additional information and large data sets can be found on social media sites such as Facebook, Foursquare, and Twitter. A number of new businesses are now building Big Data environments, based on scale-out clusters using power-efficient multicore processors that leverage consumers’ (conscious or unconscious) nearly continuous streams of data about themselves (e.g., likes, locations, and opinions).

Thanks to the network effect of successful sites, the total data generated can expand at an exponential rate. Some companies have collected and analyzed over 4 billion data points (e.g., web site cut-and-paste operations) since information collection started, and within a year the process has expanded to 20 billion data points gathered.

DIVING DEEPER INTO BIG DATA SOURCES

A change in resolution is further driving the expansion of Big Data. Here additional data points are gathered from existing systems or with the installation of new sensors that deliver more pieces of information. Some examples of increased resolution can be found in the following areas:

  • Financial transactions. Thanks to the consolidation of global trading environments and the increased use of programmed trading, the volume of transactions being collected and analyzed is doubling or tripling. Transaction volumes also fluctuate much faster, much wider, and much more unpredictably. Competition among firms is creating more data, simply because sampling for trading decisions is occurring more frequently and at faster intervals.
  • Smart instrumentation. The use of smart meters in energy grid systems, which shifts meter readings from monthly to every 15 minutes, can translate into a multithousandfold increase in data generated. Smart meter technology extends beyond just power usage and can measure heating, cooling, and other loads, which can be used as an indicator of household size at any given moment.
  • Mobile telephony. With the advances in smartphones and connected PDAs, the primary data generated from these devices have grown beyond caller, receiver, and call length. Additional data are now being harvested at exponential rates, including elements such as geographic location, text messages, browsing history, and (thanks to the addition of accelerometers) even motions, as well as social network posts and application use.

A WEALTH OF PUBLIC INFORMATION

For those looking to sample what is available for Big Data analytics, a vast amount of data exists on the Web; some of it is free, and some of it is available for a fee. Much of it is simply there for the taking. If your goal is to start gathering data, it is pretty hard to beat many of the tools that are readily available on the market. For those looking for point-and-click simplicity, Extractiv (http://www.extractiv.com) and Mozenda (http://www.mozenda.com) offer the ability to acquire data from multiple sources and to search the Web for information.

Another candidate for processing data on the Web is Google Refine (http://code.google.com/p/google-refine), a tool set that can work with messy data, cleaning them up and then transforming them into different formats for analytics. 80Legs (http://www.80legs.com) specializes in gathering data from social networking sites as well as retail and business directories.

The tools just mentioned are excellent examples for mining data from the Web to transform them into a Big Data analytics platform. However, gathering data is only the first of many steps. To garner value from the data, they must be analyzed and, better yet, visualized. Tools such as Grep (http://www.linfo.org/grep.html), Turk (http://www.mturk.com), and BigSheets (http://www-01.ibm.com/software/ebusiness/jstart/bigsheets) offer the ability to analyze data. For visualization, analysts can turn to tools such as Tableau Public (http://www.tableausoftware.com), OpenHeatMap (http://www.openheatmap.com), and Gephi (http://www.gephi.org).

Beyond the use of discovery tools, Big Data can be found through services and sites such as CrunchBase, the U.S. census, InfoChimps, Kaggle, Freebase, and Timetric. Many other services offer data sets directly for integration into Big Data processing.

The prices of some of these services are rather reasonable. For example, you can download a million Web pages through 80Legs for less than three dollars. Some of the top data sets can be found on commercial sites, yet for free. An example is the Common Crawl Corpus, which has crawl data from about five billion Web pages and is available in the ARC file format from Amazon S3. The Google Books Ngrams is another data set that Amazon S3 makes available for free. The file is in a Hadoop-friendly format. For those who may be wondering, n-grams are fixed-size sets of items. In this case, the items are words extracted from the Google Books corpus. The n specifies the number of elements in the set, so a five-gram contains five words or characters.

Many more data sets are available from Amazon S3, and it is definitely worth visiting http://aws.amazon.com/publicdatasets/ to track these down. Another site to visit for a listing of public data sets is http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public, a treasure trove of links to data sets and information related to those data sets.

GETTING STARTED WITH BIG DATA ACQUISITION

Barriers to Big Data adoption are generally cultural rather than technological. In particular, many organizations fail to implement Big Data programs because they are unable to appreciate how data analytics can improve their core business. One the most common triggers for Big Data development is a data explosion that makes existing data sets very large and increasingly difficult to manage with conventional database management tools.

As these data sets grow in size—typically ranging from several terabytes to multiple petabytes—businesses face the challenge of capturing, managing, and analyzing the data in an acceptable time frame. Getting started involves several steps, starting with training. Training is a prerequisite for understanding the paradigm shift that Big Data offers. Without that insider knowledge, it becomes difficult to explain and communicate the value of data, especially when the data are public in nature. Next on the list is the integration of development and operations teams (known as DevOps), the people most likely to deal with the burdens of storing and transforming the data into something usable.

Much of the process of moving forward will lie with the business executives and decision makers, who will also need to be brought up to speed on the value of Big Data. The advantages must be explained in a fashion that makes sense to the business operations, which in turn means that IT pros are going to have to do some legwork. To get started, it proves helpful to pursue a few ideologies:

  • Identify a problem that business leaders can understand and relate to and that commands their attention.
  • Do not focus exclusively on the technical data management challenge. Be sure to allocate resources to understand the uses for the data within the business.
  • Define the questions that must be answered to meet the business objective, and only then focus on discovering the necessary data.
  • Understand the tools available to merge the data and the business process so that the result of the data analysis is more actionable.
  • Build a scalable infrastructure that can handle growth of the data. Good analysis requires enough computing power to pull in and analyze data. Many people get discouraged because when they start the analytic process, it is slow and laborious.
  • Identify technologies that you can trust. A dizzying variety of open source Big Data software technologies are available, and many are likely to disappear within a few years. Find one that has professional vendor support, or be prepared to take on permanent maintenance of the technology as well as the solution in the long run. Hadoop seems to be attracting a lot of mainstream vendor support.
  • Choose a technology that fits the problem. Hadoop is best for large but relatively simple data set filtering, converting, sorting, and analysis. It is also good for sifting through large volumes of text. It is not really useful for ongoing persistent data management, especially if structural consistency and transactional integrity are required.
  • Be aware of changing data formats and changing data needs. For instance, a common problem faced by organizations seeking to use BI solutions to manage marketing campaigns is that those campaigns can be very specifically focused, requiring an analysis of data structures that may be in play for only a month or two. Using conventional relational database management system techniques, it can take several weeks for database administrators to get a data warehouse ready to accept the changed data, by which time the campaign is nearly over. A MapReduce solution, such as one built on a Hadoop framework, can reduce those weeks to a day or two. Thus it is not just volume but also variety that can drive Big Data adoption.

ONGOING GROWTH, NO END IN SIGHT

Data creation is occurring at a record rate. In fact, the research firm IDC’s Digital Universe Study predicts that between 2009 and 2020, digital data will grow 44-fold, to 35 zettabytes per year. It is also important to recognize that much of this data explosion is the result of an explosion in devices located at the periphery of the network, including embedded sensors, smartphones, and tablet computers. All of these data create new opportunities for data analytics in human genomics, health care, oil and gas, search, surveillance, finance, and many other areas.