One of the biggest challenges for most organizations is finding data sources to use as part of their analytics processes. As the name implies, Big Data is large, but size is not the only concern. There are several other considerations when deciding how to locate and parse Big Data sets.
The first step is to identify usable data. While that may be obvious, it is anything but simple. Locating the appropriate data to push through an analytics platform can be complex and frustrating. The source must be considered to determine whether the data set is appropriate for use. That translates into detective work or investigative reporting.
Considerations should include the following:
All of those elements and many others can affect the selection process and can have a dramatic effect on how the raw data are prepared (“scrubbed”) before the analytics process takes place.
In the IT realm, once a data source is located, the next step is to import the data into an appropriate platform. That process may be as simple as copying data onto a Hadoop cluster or as complicated as scrubbing, indexing, and importing the data into a large SQL-type table. That importation, or gathering of the data, is only one step in a multistep, sometimes complex process.
Once the importation (or real-time updating) has been performed, templates and scripts can be designed to ease further data-gathering chores. Once the process has been designed, it becomes easier to execute in the future.
Building a Big Data set ultimately serves one strategic purpose: to mine the data, or dig for something of value. Mining data involves a lot more than just running algorithms against a particular data source. Usually, the data have to be first imported into a platform that can deal with the data in an appropriate fashion. This means the data have to be transformed into something accessible, queryable, and relatable. Mining starts with a mine or, in Big Data parlance, a platform. Ultimately, to have any value, that platform must be populated with usable information.
Finding data for Big Data analytics is part science, part investigative work, and part assumption. Some of the most obvious sources for data are electronic transactions, web site logs, and sensor information. Any data the organization gathers while doing business are included. The idea is to locate as many data sources as possible and bring the data into an analytics platform. Additional data can be gathered using network taps and data replication clients. Ideally, the more data that can be captured, the more data there will be to work with.
Finding the internal data is the easy part of Big Data. It gets more complicated once data considered unrelated, external, or unstructured are bought into the equation. With that in mind, the big question with Big Data now is, “Where do I get the data from?” This is not easily answered; it takes some research to separate the wheat from the chaff, knowing that the chaff may have some value as well.
Setting out to build a Big Data warehouse takes a concentrated effort to find the appropriate data. The first step is to determine what Big Data analytics is going to be used for. For example, is the business looking to analyze marketing trends, predict web traffic, gauge customer satisfaction, or achieve some other lofty goal that can be accomplished with the current technologies?
It is this knowledge that will determine where and how to gather Big Data. Perhaps the best way to build such knowledge is to better understand the business analytics (BA) and business intelligence (BI) processes to determine how large-scale data sets can be used to interact with internal data to garner actionable results.
Every project usually starts out with a goal and with objectives to reach that goal. Big Data analytics should be no different. However, defining the goal can be a difficult process, especially when the goal is vague and amounts to little more than something like “using the data better.” It is imperative to define the goal before hunting for data sources, and in many cases, proven examples of success can be the foundation for defining a goal.
Take, for example, a retail organization. The goal for Big Data analytics may be to increase sales, a chore that spans several business ideologies and departments, including marketing, pricing, inventory, advertising, and customer relations. Once there is a goal in mind, the next step is to define the objectives, the exact means by which to reach the goal.
For a project such as the retail example, it will be necessary to gather information from a multitude of sources, some internal and others external. Some of the data may have to be purchased, and some may be available under the public domain. The key is to start with the internal, structured data first, such as sales logs, inventory movement, registered transactions, customer information, pricing, and supplier interactions.
Next come the unstructured data, such as call center and support logs, customer feedback (perhaps e-mails and other communications), surveys, and data gathered by sensors (store traffic, parking lot usage). The list can include many other internally tracked elements; however, it is critical to be aware of diminishing returns on investment with the data sourced. In other words, some log information may not be worth the effort to gather, because it will not affect the analytics outcome.
Finally, external data must be taken into account. There is a vast wealth of external information that can be used to calculate everything from customer sentiments to geopolitical issues. The data that make up the public portion of the analytics process can come from government entities, research companies, social networking sites, and a multitude of other sources.
For example, a business may decide to mine Twitter, Facebook, the U.S. census, weather information, traffic pattern information, and news archives to build a complex source of rich data. Some controls need to be in place, and that may even include scrubbing the data before processing (i.e., removing spurious information or invalid elements).
The richness of the data is the basis for predictive analytics. A company looking to increase sales may compare population trends, along with social sentiment, to customer feedback and satisfaction to identify where the sales process could be improved. The data warehouse can be used for much more after the initial processing, and real-time data could also be integrated to identify trends as they arise.
The retail situation is only one example; there are dozens of others, each of which may have a specific applicability to the task at hand.
Multiple sources are responsible for a growth in data that is applcable to Big Data technology. Some of these sources represent entirely new data sources, while others are a change in the resolution of the existing data generated. Much of that growth can be attributed to industry digitization of content.
With companies now turning to creating digital representations of existing data and acquiring everything that is new, data growth rates over the last few years have been nearly infinite, simply because most of the businesses involved started from zero.
Many industries fall under the umbrella of new data creation and digitization of existing data, and most are becoming appropriate sources for Big Data resources. Those industries include the following:
For many businesses, the additional data can come from self-service marketplaces, which record the use of affinity cards and track the sites visited, and can be combined with social networks and location-based metadata. This creates a goldmine of actionable consumer data for retailers, distributors, and manufacturers of consumer packaged goods.
The legal profession is adding to the multitude of data sources, thanks to the discovery process, which is dealing more frequently with electronic records and requiring the digitization of paper documents for faster indexing and improved access. Today, leading e-discovery companies are handling terabytes or even petabytes of information that need to be retained and reanalyzed for the full course of a legal proceeding.
Additional information and large data sets can be found on social media sites such as Facebook, Foursquare, and Twitter. A number of new businesses are now building Big Data environments, based on scale-out clusters using power-efficient multicore processors that leverage consumers’ (conscious or unconscious) nearly continuous streams of data about themselves (e.g., likes, locations, and opinions).
Thanks to the network effect of successful sites, the total data generated can expand at an exponential rate. Some companies have collected and analyzed over 4 billion data points (e.g., web site cut-and-paste operations) since information collection started, and within a year the process has expanded to 20 billion data points gathered.
A change in resolution is further driving the expansion of Big Data. Here additional data points are gathered from existing systems or with the installation of new sensors that deliver more pieces of information. Some examples of increased resolution can be found in the following areas:
For those looking to sample what is available for Big Data analytics, a vast amount of data exists on the Web; some of it is free, and some of it is available for a fee. Much of it is simply there for the taking. If your goal is to start gathering data, it is pretty hard to beat many of the tools that are readily available on the market. For those looking for point-and-click simplicity, Extractiv (http://www.extractiv.com) and Mozenda (http://www.mozenda.com) offer the ability to acquire data from multiple sources and to search the Web for information.
Another candidate for processing data on the Web is Google Refine (http://code.google.com/p/google-refine), a tool set that can work with messy data, cleaning them up and then transforming them into different formats for analytics. 80Legs (http://www.80legs.com) specializes in gathering data from social networking sites as well as retail and business directories.
The tools just mentioned are excellent examples for mining data from the Web to transform them into a Big Data analytics platform. However, gathering data is only the first of many steps. To garner value from the data, they must be analyzed and, better yet, visualized. Tools such as Grep (http://www.linfo.org/grep.html), Turk (http://www.mturk.com), and BigSheets (http://www-01.ibm.com/software/ebusiness/jstart/bigsheets) offer the ability to analyze data. For visualization, analysts can turn to tools such as Tableau Public (http://www.tableausoftware.com), OpenHeatMap (http://www.openheatmap.com), and Gephi (http://www.gephi.org).
Beyond the use of discovery tools, Big Data can be found through services and sites such as CrunchBase, the U.S. census, InfoChimps, Kaggle, Freebase, and Timetric. Many other services offer data sets directly for integration into Big Data processing.
The prices of some of these services are rather reasonable. For example, you can download a million Web pages through 80Legs for less than three dollars. Some of the top data sets can be found on commercial sites, yet for free. An example is the Common Crawl Corpus, which has crawl data from about five billion Web pages and is available in the ARC file format from Amazon S3. The Google Books Ngrams is another data set that Amazon S3 makes available for free. The file is in a Hadoop-friendly format. For those who may be wondering, n-grams are fixed-size sets of items. In this case, the items are words extracted from the Google Books corpus. The n specifies the number of elements in the set, so a five-gram contains five words or characters.
Many more data sets are available from Amazon S3, and it is definitely worth visiting http://aws.amazon.com/publicdatasets/ to track these down. Another site to visit for a listing of public data sets is http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public, a treasure trove of links to data sets and information related to those data sets.
Barriers to Big Data adoption are generally cultural rather than technological. In particular, many organizations fail to implement Big Data programs because they are unable to appreciate how data analytics can improve their core business. One the most common triggers for Big Data development is a data explosion that makes existing data sets very large and increasingly difficult to manage with conventional database management tools.
As these data sets grow in size—typically ranging from several terabytes to multiple petabytes—businesses face the challenge of capturing, managing, and analyzing the data in an acceptable time frame. Getting started involves several steps, starting with training. Training is a prerequisite for understanding the paradigm shift that Big Data offers. Without that insider knowledge, it becomes difficult to explain and communicate the value of data, especially when the data are public in nature. Next on the list is the integration of development and operations teams (known as DevOps), the people most likely to deal with the burdens of storing and transforming the data into something usable.
Much of the process of moving forward will lie with the business executives and decision makers, who will also need to be brought up to speed on the value of Big Data. The advantages must be explained in a fashion that makes sense to the business operations, which in turn means that IT pros are going to have to do some legwork. To get started, it proves helpful to pursue a few ideologies:
Data creation is occurring at a record rate. In fact, the research firm IDC’s Digital Universe Study predicts that between 2009 and 2020, digital data will grow 44-fold, to 35 zettabytes per year. It is also important to recognize that much of this data explosion is the result of an explosion in devices located at the periphery of the network, including embedded sensors, smartphones, and tablet computers. All of these data create new opportunities for data analytics in human genomics, health care, oil and gas, search, surveillance, finance, and many other areas.