Like any other technology or process, there obviously are best practices that can be applied to the problems of Big Data. In most cases, best practices usually arise from years of testing and measuring results, giving them a solid foundation to build on. However, Big Data, as it is applied today, is relatively new, short circuiting the tried-and-true methodology used in the past to derive best practices. Nevertheless, best practices are presenting themselves at a fairly accelerated rate, which means that we can still learn from the mistakes and successes of others to define what works best and what doesn’t.
The evolutionary aspect of Big Data tends to affect best practices, so what may be best today may not necessarily be best tomorrow. That said, there are still some core proven techniques that can be applied to Big Data analytics and that should withstand the test of time. With new terms, new skill sets, new products, and new providers, the world of Big Data analytics can seem unfamiliar, but tried-and-true data management best practices do hold up well in this still emerging discipline.
As with any business intelligence (BI) and/or data warehouse initiative, it is critical to have a clear understanding of an organization’s data management requirements and a well-defined strategy before venturing too far down the Big Data analytics path. Big Data analytics is widely hyped, and companies in all sectors are being flooded with new data sources and ever larger amounts of information. Yet making a big investment to attack the Big Data problem without first figuring out how doing so can really add value to the business is one of the most serious missteps for would-be users.
The trick is to start from a business perspective and not get too hung up on the technology, which may entail mediating conversations among the chief information officer (CIO), the data scientists, and other businesspeople to identify what the business objectives are and what value can be derived. Defining exactly what data are available and mapping out how an organization can best leverage the resources is a key part of that exercise.
CIOs, IT managers, and BI and data warehouse professionals need to examine what data are being retained, aggregated, and utilized and compare that with what data are being thrown away. It is also critical to consider external data sources that are currently not being tapped but that could be a compelling addition to the mix. Even if companies aren’t sure how and when they plan to jump into Big Data analytics, there are benefits to going through this kind of an evaluation sooner rather than later.
Beginning the process of accumulating data also makes you better prepared for the eventual leap to Big Data, even if you don’t know what you are going to use it for at the outset. The trick is to start accumulating the information as soon as possible. Otherwise there may be a missed opportunity because information may fall through the cracks, and you may not have that rich history of information to draw on when Big Data enters the picture.
When analyzing Big Data, it makes sense to define small, high-value opportunities and use those as a starting point. Ideally, those smaller tasks will build the expertise needed to deal with the larger questions an organization may have for the analytics process. As companies expand the data sources and types of information they are looking to analyze, and as they start to create the all-important analytical models that can help them uncover patterns and correlations in both structured and unstructured data, they need to be vigilant about homing in on the findings that are most important to their stated business objectives.
It is critical to avoid situations in which you end up with a process that identifies news patterns and data relationships that offer little value to the business process. That creates a dead spot in an analytics matrix where patterns, though new, may not be relevant to the questions being asked.
Successful Big Data projects tend to start with very targeted goals and focus on smaller data sets. Only then can that success be built upon to create a true Big Data analytics methodology that starts small and grows after the practice has served the enterprise rather well, allowing value to be created with little upfront investment while preparing the company for the potential windfall of information that can be derived from analytics.
That can be accomplished by starting with “small bites” (i.e., taking individual data flows and migrating those into different systems for converged processing). Over time, those small bites will turn into big bites, and Big Data will be born. The ability to scale will prove important—as data collection increases, the scale of the system will need to grow to accommodate the data.
Leveraging open source Hadoop technologies and emerging packaged analytics tools makes an open source environment more familiar to business analysts trained in using SQL. Ultimately, scale will become the primary factor when mapping out a Big Data analytics road map, and business analysts will need to eschew the ways of SQL to grasp the concept of distributed platforms that run on nodes and clusters.
It is critical to consider what the buildup will look like. It can be accomplished by determining how much data will need to be gathered six months from now and calculating how many more servers may be needed to handle it. You will also have to make sure that the software is up to the task of scaling. One big mistake is to be ignorant about the potential growth of the solution and the potential popularity of the solution once it is rolled into production.
As analytics scales, data governance becomes increasingly important, a situation that is no different with Big Data than it is with any other large-scale network operation. The same can be said for information governance practices, which is just as important today with Big Data as it was yesterday with data warehousing. A critical caveat is to remember that information is a corporate asset and should be treated as such.
There are many potential reasons that Big Data analytics projects fall short of their goals and expectations, and in some cases it is better to know what not to do rather than knowing what to do. This leads us to the idea of identifying “worst practices,” so that you can avoid making the same mistakes that others have made in the past. It is better to learn from the errors of others than to make your own. Some worst practices to look out for are the following:
It is said that every journey begins with the first step, and the journey toward creating an effective Big Data analytics holds true to that axiom. However, it takes more than one step to reach a destination of success. Organizations embarking on Big Data analytics programs require a strong implementation plan to make sure that the analytics process works for them. Choosing the technology that will be used is only half the battle when preparing for a Big Data initiative. Once a company identifies the right database software and analytics tools and begins to put the technology infrastructure in place, it’s ready to move to the next level and develop a real strategy for success.
The importance of effective project management processes to creating a successful Big Data analytics program also cannot be overstated. The following tips offer advice on steps that businesses should take to help ensure a smooth deployment:
There’s no one way to ensure Big Data analytics success. But following a set of frameworks and best practices, including the tips outlined here, can help organizations to keep their Big Data initiatives on track. The technical details of a Big Data installation are quite intensive and need to be looked at and considered in an in-depth manner. That isn’t enough, though: Both the technical aspects and the business factors must be taken into account to make sure that organizations get the desired outcomes from their Big Data analytics investments.
There are people who believe that anomalies are something best ignored when processing Big Data, and they have created sophisticated scrubbing programs to discard what is considered an anomaly. That can be a sound practice when working with particular types of data, since anomalies can color the results. However, there are times when anomalies prove to be more valuable than the rest of the data in a particular context. The lesson to be learned is “Don’t discard data without further analysis.”
Take, for example, the world of high-end network security, where encryption is the norm, access is logged, and data are examined in real time. Here the ability to identify something that fits into the uncharacteristic movement of data is of the utmost importance—in other words, security problems are detected by looking at anomalies. That idea can be applied to almost any discipline, ranging from financial auditing to scientific inquiry to detecting cyber-threats, all critical services that are based on identifying something out of the ordinary.
In the world of Big Data, that “something out of the ordinary” may constitute a single log entry out of millions, which, on its own, may not be worth noticing. But when analyzed against traffic, access, and data flow, that single entry may have untold value and can be a key piece of forensic information. With computer security, seeking anomalies makes a great deal of sense. Nevertheless, many data scientists are reluctant to put much stock in anomalies for other tasks.
Anomalies can actually be the harbingers of trends. Take online shopping, for example, in which many buying trends start off as isolated anomalies created by early adopters of products; these can then transcend into a fad and ultimately become a top product. That type of information—early trending—can make or break a sales cycle. Nowhere is this more true than on Wall Street, where anomalous stock trades can set off all sorts of alarms and create frenzies, all driven by the detection of a few small events uncovered in a pile of Big Data.
Given a large enough data set, anomalies commonly appear. One of the more interesting aspects of anomaly value comes from the realm of social networking, where posts, tweets, and updates are thrown into Big Data and then analyzed. Here businesses are looking at information such as customer sentiment, using a horizontal approach to compare anomalies across many different types of time series, the idea being that different dimensions could share similar anomaly patterns.
Retail shopping is a good example of that. A group of people may do grocery shopping relatively consistently throughout the year at Safeway, Trader Joe’s, or Whole Foods but then do holiday shopping at Best Buy and Toys“R”Us, leading to the expected year-end increases. A company like Apple might see a level pattern for most of the year, but when a new iPhone is released, the customers dutifully line up along with the rest of the world around that beautiful structure of glass and steel.
This information translates to the proverbial needle in a haystack that needs to be brought forth above other data elements. It is the concept that for about 300 days of the year, the Apple store is a typical electronics retailer in terms of temporal buying patterns (if not profit margins). However, that all changes when an anomalous event (such as a new product launch) translates into two or three annual blockbuster events, and it becomes the differentiating factor between an Apple store and other electronics retailers. Common trends among industries can be used to discount the expected seasonal variations in order to focus on the truly unique occurrences.
For Twitter data, there are often big disparities among dimensions. Hashtags are typically associated with transient or irregular phenomena, as opposed to, for instance, the massive regularity of tweets emanating from a big country. Because of this greater degree of within-dimension similarity, we should treat the dimensions separately. The dimensional application of algorithms can identify situations in which hashtags and user names, rather than locations and time zones, dominate the list of anomalies, indicating that there is very little similarity among the items in each of these groups.
Given so many anomalies, making sense of them becomes a difficult task, creating the following questions: What could have caused the massive upsurges in the otherwise regular traffic? What domains are involved? Are URL shorteners and Twitter live video streaming services involved? Sorting by the magnitude of the anomaly yields a cursory and excessively restricted view; correlations of the anomalies often exist within and between dimensions. There can be a great deal of synergy among algorithms, but it may take some sort of clustering procedure to uncover them.
In the past, Big Data analytics usually involved a compromise between performance and accuracy. This situation was caused by the fact that technology had to deal with large data sets that often required hours or days to analyze and run the appropriate algorithms on. Hadoop solved some of these problems by using clustered processing, and other technologies have been developed that have boosted performance. Yet real-time analytics has been mostly a dream for the typical organization, which has been constrained by budgetary limits for storage and processing power—two elements that Big Data devours at prodigious rates.
These constraints created a situation in which if you needed answers fast, you would be forced to look at smaller data sets, which could lead to less accurate results. Accuracy, in contrast, often required the opposite approach: working with larger data sets and taking more processing time.
As technology and innovation evolve, so do the available options. The industry is addressing the speed-versus-accuracy problem with in-memory processing technologies, in which data are processed in volatile memory instead of directly on disk. Data sets are loaded into a high-speed cache and the algorithms are applied there, reducing all of the input and output typically needed to read and write to and from physical disk drives.
Organizations are realizing the value of analyzed data and are seeking ways to increase that value even further. For many, the path to more value comes in the form of faster processing. Discovering trends and applying algorithms to process information takes on additional value if that analysis can deliver real-time results.
However, the latency of disk-based clusters and wide area network connections makes it difficult to obtain instantaneous results from BI solutions. The question then is whether real-time processing can deliver enough value to offset the additional expenses of faster technologies. To answer this, one must determine what the ultimate goal of real-time processing is. Is it to speed up results for a particular business process? Is it to meet the needs of a retail transaction? Is it to gain a competitive edge?
The reasons can be many, yet the value gained is still dictated by the price feasibility of faster processing technologies. That is where in-memory processing comes into play. However, there are many other factors that will drive the move toward in-memory processing. For example, a recent study by the Economist estimated that humans created about 150 exabytes of information in the year 2005. Although that may sound like an expansive amount, it pales in comparison to the over 1,200 exabytes created in 2011.
Furthermore, the research firm IDC (International Data Corporation) estimates that digital content doubles every 18 months. Further complicating the processing of data is the related growth of unstructured data. In fact, research outlet Gartner projects that as much as 80 percent of enterprise data will take on the form of unstructured elements, spanning traditional and nontraditional sources.
The type of data, the amount of data, and the expediency of accessing the data all influence the decision of whether to use in-memory processing. Nevertheless, these factors might not hold back the coming tide of advanced in-memory processing solutions simply because of the value that in-memory processing brings to businesses.
To understand the real-world advantages of in-memory processing, you have to look at how Big Data has been dealt with to date and understand the current physical limits of computing, which are dictated by the speed of accessing data from relational databases, processing instructions, and all of the other elements required to process large data sets.
Using disk-based processing meant that complex calculations that involved multiple data sets or algorithmic search processing could not happen in real time. Data scientists would have to wait a few hours to a few days for meaningful results—not the best solution for fast business processes and decisions.
Today businesses are demanding faster results that can be used to make quicker decisions and be used with tools that help organizations to access, analyze, govern, and share information. All of this brings increasing value to Big Data.
The use of in-memory technology brings that expediency to analytics, ultimately increasing the value, which is further accentuated by the falling prices of the technology. The availability and capacity per dollar of system memory has increased in the last few years, leading to a repostulation of how large amounts of data can be stored and acted upon.
Falling prices and increased capacity have created an environment where enterprises can now store a primary database in silicon-based main memory, resulting in an exponential improvement in performance and enabling the development of completely new applications. Physical hard drives are no longer the limiting element for expediency in processing.
When business decision makers are provided with information and analytics instantaneously, new insights can be developed and business processes executed in ways never thought possible. In-memory processing signals a significant paradigm shift for IT operations dealing with BI and business analytics as they apply to large data sets.
In-memory processing is poised to create a new era in business management in which managers can base their decisions on real-time analyses of complex business data. The primary advantages are as follows:
In-memory processing offers these advantages and many others by shifting the analytics process from a cluster of hard drives and independent CPUs to a single comprehensive database that can handle all the day-to-day transactions and updates, as well as analytical requests, in real time.
In-memory computing technology allows for the processing of massive quantities of transactional data in the main memory of the server, thereby providing immediate results from the analysis of these transactions.
Since in-memory technology allows data to be accessed directly from memory, query results come back much more quickly than they would from a traditional disk-based warehouse. The time it takes to update the database is also significantly reduced, and the system can handle more queries at a time.
With this vast improvement in process speed, query quality, and business insight, in-memory database management systems promise performance that is 10 to 20 times faster than traditional disk-based models.
The elements of in-memory computing are not new, but they have now been developed to a point where common adoption is possible. Recent improvements in hardware economics and innovations in software have now made it possible for massive amounts of data to be sifted, correlated, and updated in seconds with in-memory technology. Technological advances in main memory, multicore processing, and data management have combined to deliver dramatic increases in performance.
In-memory technology promises impressive benefits in many areas. The most significant are cost savings, enhanced efficiency, and greater immediate visibility of a sort that can enable improved decision making.
Businesses of all sizes and across all industries can benefit from the cost savings obtainable through in-memory technology. Database management currently accounts for more than 25 percent of most companies’ IT budgets. Since in-memory databases use hardware systems that require far less power than traditional database management systems, they dramatically reduce hardware and maintenance costs.
In-memory databases also reduce the burden on a company’s overall IT landscape, freeing up resources previously devoted to responding to requests for reports. And since in-memory solutions are based on proven mature technology, the implementations are nondisruptive, allowing companies to return to operations quickly and easily.
Any company with operations that depend on frequent data updates will be able to run more efficiently with in-memory technology. The conversion to in-memory technology allows an entire technological layer to be removed from a company’s IT architecture, reducing the complexity and infrastructure that traditional systems require. This reduced complexity allows data to be retrieved nearly instantaneously, making all of the teams in the business more efficient.
In-memory computing allows any business user to easily carve out subsets of BI for convenient departmental usage. Work groups can operate autonomously without affecting the workload imposed on a central data warehouse. And, perhaps most important, business users no longer have to call for IT support to gain relevant insight into business data.
These performance gains also allow business users on the road to retrieve more useful information via their mobile devices, an ability that is increasingly important as more businesses incorporate mobile technologies into their operations.
With that in mind, it becomes easy to see how in-memory technology allows organizations to compile a comprehensive overview of their business data, instead of being limited to subsets of data that have been compartmentalized in a data warehouse.
With those improvements to database visibility, enterprises are able to shift from after-event analysis (reactive) to real-time decision making (proactive) and then create business models that are predictive rather than response based. More value can be realized by combining easy-to-use analytic solutions from the start with the analytics platform. This allows anyone in the organization to build queries and dashboards with very little expertise, which in turn has the potential to create a pool of content experts who, without external support, can become more proactive in their actions.
In-memory technology further benefits enterprises because it allows for greater specificity of information, so that the data elements are personalized to both the customer and the business user’s individual needs. That allows a particular department or line of business to self-service specific needs whose results can trickle up or down the management chain, affecting account executives, supply chain management, and financial operations.
Customer teams can combine different sets of data quickly and easily to analyze a customer’s past and current business conditions using in-memory technology from almost any location, ranging from the office to the road, on their mobile devices. This allows business users to interact directly with customers using the most up-to-date information; it creates a collaborative situation in which business users can interact with the data directly. Business users can experiment with the data in real time to create more insightful sales and marketing campaigns. Sales teams have instant access to the information they need, leading to an entirely new level of customer insight that can maximize revenue growth by enabling more powerful up-selling and cross-selling.
With traditional disk-based systems, data are usually processed overnight, which may result in businesses being late to react to important supply alerts. In-memory technology can eliminate that problem by giving businesses full visibility of their supply-and-demand chains on a second-by-second basis. Businesses are able to gain insight in real time, allowing them to react to changing business conditions. For example, businesses may be able to create alerts, such as an early warning to restock a specific product, and can respond accordingly.
Financial controllers face increasing challenges brought on by increased data volumes, slow data processing, delayed analytics, and slow data-response times. These challenges can limit the controllers’ analysis time frames to several days rather than the more useful months or quarters. This can lead to a variety of delays, particularly at the closing of financial periods. However, in-memory technology, large-volume data analysis, and a flexible modeling environment can result in faster-closing financial quarters and better visibility of detailed finance data for extended periods.
In-memory technology has the potential to help businesses in any industry operate more efficiently, from consumer products and retailing to manufacturing and financial services. Consumer products companies can use in-memory technology to manage their suppliers, track and trace products, manage promotions, provide support in complying with Environmental Protection Agency standards, and perform analyses on defective and under-warranty products.
Retail companies can manage store operations in multiple locations, conduct point-of-sale analytics, perform multichannel pricing analyses, and track damaged, spoiled, and returned products. Manufacturing organizations can use in-memory technology to ensure operational performance management, conduct analytics on production and maintenance, and perform real-time asset utilization studies. Financial services companies can conduct hedge fund trading analyses, such as managing client exposures to currencies, equities, derivatives, and other instruments. Using information accessed from in-memory technology, they can conduct real-time systematic risk management and reporting based on market trading exposure.
As the popularity of Big Data analytics grows, in-memory processing is going to become the mainstay for many businesses looking for a competitive edge.