Chapter 6
The Nuts and Bolts of Big Data
Assembling a Big Data solution is sort of like putting together an erector set. There are various pieces and elements that must be put together in the proper fashion to make sure everything works adequately, and there are almost endless combinations of configurations that can be made with the components at hand.
With Big Data, the components include platform pieces, servers, virtualization solutions, storage arrays, applications, sensors, and routing equipment. The right pieces must be picked and integrated in a fashion that offers the best performance, high efficiency, affordability, ease of management and use, and scalability.
THE STORAGE DILEMMA
Big Data consists of data sets that are too large to be acquired, handled, analyzed, or stored in an appropriate time frame using the traditional infrastructures. Big is a term relative to the size of the organization and, more important, to the scope of the IT infrastructure that’s in place. The scale of Big Data directly affects the storage platform that must be put in place, and those deploying storage solutions have to understand that Big Data uses storage resources differently than the typical enterprise application does.
These factors can make provisioning storage a complex endeavor, especially when one considers that Big Data also includes analysis; this is driven by the expectation that there will be value in all of the information a business is accumulating and a way to draw that value out.
Originally driven by the concept that storage capacity is inexpensive and constantly dropping in price, businesses have been compelled to save more data, with the hope that business intelligence (BI) can leverage the mountains of new data created every day. Organizations are also saving data that have already been analyzed, which can potentially be used for marking trends in relation to future data collections.
Aside from the ability to store more data than ever before, businesses also have access to more types of data. These data sources include Internet transactions, social networking activity, automated sensors, mobile devices, scientific instrumentation, voice over Internet protocol, and video elements. In addition to creating static data points, transactions can create a certain velocity to this data growth. For example, the extraordinary growth of social media is generating new transactions and records. But the availability of ever-expanding data sets doesn’t guarantee success in the search for business value.
As data sets continue to grow with both structured and unstructured data and data analysis becomes more diverse, traditional enterprise storage system designs are becoming less able to meet the needs of Big Data. This situation has driven storage vendors to design new storage platforms that incorporate block- and file-based systems to meet the needs of Big Data and associated analytics.
Meeting the challenges posed by Big Data means focusing on some key storage ideologies and understanding how those storage design elements interact with Big Data demands, including the following:
- Capacity. Big Data can mean petabytes of data. Big Data storage systems must therefore be able to quickly and easily change scale to meet the growth of data collections. These storage systems will need to add capacity in modules or arrays that are transparent to users, without taking systems down. Most Big Data environments are turning to scale-out storage (the ability to increase storage performance as capacity increases) technologies to meet that criterion. The clustered architecture of scale-out storage solutions features nodes of storage capacity with embedded processing power and connectivity that can grow seamlessly, avoiding the silos of storage that traditional systems can create.
Big Data also means many large and small files. Managing the accumulation of metadata for file systems with multiple large and small files can reduce scalability and impact performance, a situation that can be a problem for traditional network-attached storage systems. Object-based storage architectures, in contrast, can allow Big Data storage systems to expand file counts into the billions without suffering the overhead problems that traditional file systems encounter. Object-based storage systems can also scale geographically, enabling large infrastructures to be spread across multiple locations.
- Security. Many types of data carry security standards that are driven by compliance laws and regulations. The data may be financial, medical, or government intelligence and may be part of an analytics set yet still be protected. While those data may not be different from what current IT managers must accommodate, Big Data analytics may need to cross-reference data that have not been commingled in the past, and this can create some new security considerations. In turn, IT managers should consider the security footing of the data stored in an array used for Big Data analytics and the people who will access the data.
- Latency. In many cases, Big Data employs a real-time component, especially in use scenarios involving Web transactions or financial transactions. An example is tailoring Web advertising to each user’s browsing history, which demands real-time analytics to function. Storage systems must be able to grow rapidly and still maintain performance. Latency produces “stale” data. That is another case in which scale-out architectures solve problems. The technology enables the cluster of storage nodes to increase in processing power and connectivity as they grow in capacity. Object-based storage systems can parallel data streams, further improving output.
Most Big Data environments need to provide high input-output operations per second (IOPS) performance, especially those used in high-performance computing environments. Virtualization of server resources, which is a common methodology used to expand compute resources without the purchase of new hardware, drives high IOPS requirements, just as it does in traditional IT environments. Those high IOPS performance requirements can be met with solid-state storage devices, which can be implemented in many different formats, including simple server-based cache to all-flash-based scalable storage systems.
- Access. As businesses get a better understanding of the potential of Big Data analysis, the need to compare different data sets increases, and with it, more people are bought into the data sharing loop. The quest to create business value drives businesses to look at more ways to cross-reference different data objects from various platforms. Storage infrastructures that include global file systems can address this issue, since they allow multiple users on multiple hosts to access files from many different back-end storage systems in multiple locations.
- Flexibility. Big Data storage infrastructures can grow very large, and that should be considered as part of the design challenge, dictating that care should be taken in the design and allowing the storage infrastructure to grow and evolve along with the analytics component of the mission. Big Data storage infrastructures also need to account for data migration challenges, at least during the start-up phase. Ideally, data migration will become something that is no longer needed in the world of Big Data, simply because the data are distributed in multiple locations.
- Persistence. Big Data applications often involve regulatory compliance requirements, which dictate that data must be saved for years or decades. Examples are medical information, which is often saved for the life of the patient, and financial information, which is typically saved for seven years. However, Big Data users are often saving data longer because they are part of a historical record or are used for time-based analysis. The requirement for longevity means that storage manufacturers need to include ongoing integrity checks and other long-term reliability features as well as address the need for data-in-place upgrades.
- Cost. Big Data can be expensive. Given the scale at which many organizations are operating their Big Data environments, cost containment is imperative. That means more efficiency as well as less expensive components. Storage deduplication has already entered the primary storage market and, depending on the data types involved, could bring some value for Big Data storage systems. The ability to reduce capacity consumption even by a few percentage points provides a significant return on investment as data sets grow. Other Big Data storage technologies that can improve efficiencies are thin provisioning, snapshots, and cloning.
- Thin provisioning operates by allocating disk storage space in a flexible manner among multiple users based on the minimum space required by each user at any given time.
- Snapshots streamline access to stored data and can speed up the process of data recovery. There are two main types of storage snapshot: copy-on-write (or low-capacity) snapshot and split-mirror snapshot. Utilities are available that can automatically generate either type.
- Disk cloning is copying the contents of a computer’s hard drive. The contents are typically saved as a disk image file and transferred to a storage medium, which could be another computer’s hard drive or removable media such as a DVD or a USB drive.
Data storage systems have evolved to include an archive component, which is important for organizations that are dealing with historical trends or long-term retention requirements. From a capacity and dollar standpoint, tape is still the most economical storage medium. Today, systems that support multiterabyte cartridges are becoming the de facto standard in many of these environments.
The biggest effect on cost containment can be traced to the use of commodity hardware. This is a good thing, since the majority of Big Data infrastructures won’t be able to rely on the big iron enterprises of the past. Most of the first and largest Big Data users have built their own “white-box” systems on-site, which leverage a commodity-oriented, cost-saving strategy.
These examples and others have driven the trend of cost containment, and more storage products are arriving on the market that are software based and can be installed on existing systems or on common, off-the-shelf hardware. In addition, many of the same vendors are selling their software technologies as commodity appliances or partnering with hardware manufacturers to produce similar offerings. That all adds up to cost-saving strategies, which brings Big Data into the reach of smaller and smaller businesses.
- Application awareness. Initially, Big Data implementations were designed around application-specific infrastructures, such as custom systems developed for government projects or the white-box systems engineered by large Internet service companies. Application awareness is becoming common in mainstream storage systems and should improve efficiency or performance, which fits right into the needs of a Big Data environment.
- Small and medium business. The value of Big Data and the associated analytics is trickling down to smaller organizations, which creates another challenge for those building Big Data storage infrastructures: creating smaller initial implementations that can scale yet fit into the budgets of smaller organizations.
BUILDING A PLATFORM
Like any application platform, a Big Data application platform must support all of the functionality required for any application platform, including elements such as scalability, security, availability, and continuity.
Yet Big Data Application platforms are unique; they need to be able to handle massive amounts of data across multiple data stores and initiate concurrent processing to save time. This means that a Big Data platform should include built-in support for technologies such as MapReduce, integration with external Not only SQL (NoSQL) databases, parallel processing capabilities, and distributed data services. It should also make use of the new integration targets, at least from a development perspective.
Consequently, there are specific characteristics and features that a Big Data platform should offer to work effectively with Big Data analytics processes:
- Support for batch and real-time analytics. Most of the existing platforms for processing data were designed for handling transactional Web applications and have little support for business analytics applications. That situation has driven Hadoop to become the de facto standard for handling batch processing. However, real-time analytics is altogether different, requiring something more than Hadoop can offer. An event-processing framework needs to be in place as well. Fortunately, several technologies and processing alternatives exist on the market that can bring real-time analytics into Big Data platforms, and many major vendors, such as Oracle, HP, and IBM, are offering the hardware and software to bring real-time processing to the forefront. However, for the smaller business that may not be a viable option because of the cost. For now, real-time processing remains a function that is provided as a service via the cloud for smaller businesses.
- Alternative approaches. Transforming Big Data application development into something more mainstream may be the best way to leverage what is offered by Big Data. This means creating a built-in stack that integrates with Big Data databases from the NoSQL world and creating MapReduce frameworks such as Hadoop and distributed processing. Development should account for the existing transaction-processing and event-processing semantics that come with the handling of the real-time analytics that fit into the Big Data world.
Creating Big Data applications is very different from writing a typical “CRUD application” (create, retrieve, update, delete) for a centralized relational database. The primary difference is with the design of the data domain model, as well as the API and Query semantics that will be used to access and process that data. Mapping is an effective approach in Big Data, hence the success of MapReduce, in which there is an impedance mismatch between different data models and sources. An appropriate example is the use of object and relational mapping tools like Hibernate for building a bridge between the impedance mismatches.
- Available Big Data mapping tools. Batch-processing projects are being serviced with frameworks such as Hive, which provide an SQL-like facade for handling complex batch processing with Hadoop. However, other tools are starting to show promise. An example is JPA, which provides a more standardized JEE abstraction that fits into real-time Big Data applications. The Google app Engine uses Data Nucleus along with Bigtable to achieve the same goal, while GigaSpaces uses OpenJPA’s JPA abstraction combined with an in-memory data grid. Red Hat takes a different approach and leverages Hibernate object-grid mapping to map Big Data.
- Big Data abstraction tools. There are several choices available to abstract data, ranging from open source tools to commercial distributions of specialized products. One to pay attention to is Spring Data from SpringSource, which is a high-level abstraction tool that offers the ability to map different data stores of all kinds into one common abstraction through annotation and a plug-in approach.
Of course, one of the primary capabilities offered by abstraction tools is the ability to normalize and interpret the data into a uniform structure, which can be further worked with. The key here is to make sure that whatever abstraction technology is employed deals with current and future data sets efficiently.
- Business logic. A critical component of the Big Data analytics process is logic, especially business logic, which is responsible for processing the data. Currently, MapReduce reigns supreme in the realm of Big Data business logic. MapReduce was designed to handle the processing of massive amounts of data through moving the processing logic to the data and distributing the logic in parallel to all nodes. Another factor that adds to the appeal of MapReduce is that developing parallel processing code is very complex.
When designing a custom Big Data application platform, it is critical to make MapReduce and parallel execution simple. That can be accomplished by mapping the semantics into existing programming models. An example is to extend an existing model, such as SessionBean, to support the needed semantics. This makes parallel processing look like a standard invocation of single-job execution.
- Moving away from SQL. SQL is a great query language. However, it is limited, at least in the realm of Big Data. The problem lies in the fact that SQL relies on a schema to work properly, and Big Data, especially when it is unstructured, does not work well with schema-based queries. It is the dynamic data structure of Big Data that confounds the SQL schema-based processes. Here Big Data platforms must be able to support schema-less semantics, which in turn means that the data mapping layer would need to be extended to support document semantics. Examples are MongoDB, CouchBase, Cassandra, and the GigaSpaces document API. The key here is to make sure that Big Data application platforms support more relaxed versions of those semantics, with a focus on providing flexibility in consistency, scalability, and performance.
- In-memory processing. If the goal is to deliver the best performance and reduce latency, then one must consider using RAM-based devices and perform processing in-memory. However, for that to work effectively, Big Data platforms need to provide a seamless integration between RAM and disk-based devices in which data that are written in RAM would be synched into the disk asynchronously. Also, the platforms need to provide common abstractions that allow users the same data access API for both devices and thus make it easier to choose the right tool for the job without changing the application code.
- Built-in support for event-driven data distribution. Big Data applications (and platforms) must also be able to work with event-driven processes. With Big Data, this means there must be data awareness incorporated, which makes it easy to route messages based on data affinity and the content of the message. There also have to be controls that allow the creation of fine-grained semantics for triggering events based on data operations (such as add, delete, and update) and content, as with complex event processing.
- Support for public, private, and hybrid clouds. Big Data applications consume large amounts of computer and storage resources. This has led to the use of the cloud and its elastic capabilities for running Big Data applications, which in turn can offer a more economical approach to processing Big Data jobs. To take advantage of those economics, Big Data application platforms must include built-in support for public, private, and hybrid clouds that will include seamless transitions between the various cloud platforms through integration with the available frameworks. Examples abound, such as JClouds and Cloud Bursting, which provides a hybrid model for using cloud resources as spare capacity to handle load.
- Consistent management. The typical Big Data application stack incorporates several layers, including the database itself, the Web tier, the processing tier, caching layer, the data synchronization and distribution layer, and reporting tools. A major disadvantage for those managing Big Data applications is that each of those layers comes with different management, provisioning, monitoring, and troubleshooting tools. Add to that the inherent complexity of Big Data applications, and effective management, along with the associated maintenance, becomes difficult.
With that in mind, it becomes critical to choose a Big Data application platform that integrates the management stack with the application stack. An integrated management capability is one of the best productivity elements that can be incorporated into a Big Data platform.
Building a Big Data platform is no easy chore, especially when one considers that there may be a multitude of right ways and wrong ways to do it. This is further complicated by the plethora of tools, technologies, and methodologies available. However, there is a bright side that stresses flexibility, and since Big Data is constantly evolving, flexibility will rule in building a custom platform or choosing one off the shelf.
BRINGING STRUCTURE TO UNSTRUCTURED DATA
In its native format, a large pile of unstructured data has little value. It is burdensome in the typical enterprise, especially one that has not adopted Big Data practices to extract the value.
However, extracting value can be akin to finding a needle in a haystack, and if that haystack is spread across several farms and the needle is in pieces, it becomes even more difficult. One of the primary jobs of Big Data analytics is to piece that needle back together and organize the haystack into a single entity to speed up the search. That can be a tall order with unstructured data, a type of data that is growing in volume and size as well as complexity.
Unstructured (or uncatalogued) data can take many forms, such as historical photograph collections, audio clips, research notes, genealogy materials, and other riches hidden in various data libraries. The Big Data movement has driven methodologies to create dynamic and meaningful links among these currently unstructured information sources.
For the most part, that has resulted in the creation of metadata and methods to bring structure to unstructured data. Currently, two dominant technical and structural approaches have emerged: (1) a reliance on search technologies, and (2) a trend toward automated data categorization. Many data categorization techniques are being applied across the landscape, including taxonomies, semantics, natural language recognition, auto-categorization, “what’s related” functionality, data visualization, and personalization. The idea is to provide the information that is needed to process an analytics function.
The importance of integrating structured and loosely unstructured data cannot be overstated in the world of Big Data analytics. There are a few enabling technical strategies that make it possible to sort the wheat from the chaff. For instance, there is SQL-NoSQL Integration. Those using MapReduce and other schemaless frameworks have been struggling with structural data and analytics coming from the relational database management system (RDBMS) side. However, the integration of the relational and nonrelational paradigms provides the most powerful analytics by bringing together the best of both worlds.
There are several technologies that enable this integration; some of them take advantage of the processing power of MapReduce frameworks like Hadoop to perform data transformation in place, rather than doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the computing capabilities of engineered machines and using change data capture to synchronize, source, and target, again without the overhead of a middle tier. In both cases, the overarching principle is real-time data integration, in which reflecting data change instantly in a data warehouse—whether originating from a MapReduce job or from a transactional system—and create downstream analytics that have an accurate, timely view of reality. Others are turning to linked data and semantics, where data sets are created using linking methodologies that focus on the semantics of the data.
This fits well into the broader notion of pointing at external sources from within a data set, which has been around for quite a long time. That ability to point to unstructured data (whether residing in the file system or some external source) merely becomes an extension of the given capabilities, in which the ability to store and process XML and XQuery natively within an RDBMS enables the combination of different degrees of structure while searching and analyzing the underlying data.
Newer semantics technologies can take this further by providing a set of formalized XML-based standards for storage, querying, and manipulation of data. Since these technologies have been focused on the Web, many businesses have not associated the process with Big Data solutions.
Most NoSQL technologies fall into the categories of key value stores, graph, or document databases; the semantic resource description framework (RDF) triple store creates an alternative. It is not relational in the traditional sense, but it still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.
A record in an RDF store is composed of a triple, consisting of subject, predicate, and object. That does not impose a relational schema on the data, which supports the addition of new elements without structural modifications to the store. In addition, the underlying system can resolve references by inferring new triples from the existing records using a rules set. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while also offering a more expressive way to model data than a key value store.
One of the most powerful aspects of semantic technology comes from the world of linguistics and natural language processing, also known as entity extraction. This is a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.
Another method that brings structure to the unstructured is the text analytics tool, which is improving daily as scientists come up with new ways of making algorithms understand written text more accurately. Today’s algorithms can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for this tool is to move toward recognition of further useful entities, such as product names, brands, events, and skills.
Entity relation extraction is another important tool, in which a relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction detects new knowledge in Big Data. Other unstructured data tools are detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos is growing at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation.
PROCESSING POWER
Analyzing Big Data can take massive amounts of processing power. There is a simple relationship between data analytics and processing power: the larger the data set and the faster that results are needed, the more processing power it takes. However, processing Big Data analytics is not a simple matter of throwing the latest and fastest processor at the problem; it is more about the ideology behind grid-computing technologies.
Big Data involves more than just distributed processing technologies, like Hadoop. It is also about faster processors, wider bandwidth communications, and larger and cheaper storage to achieve the goal of making the data consumable. That in turn drives the idea of data visualization and interface technologies, which make the results of analysis consumable by humans, and that is where the raw processing power comes to bear for analytics.
Intuitiveness comes from proper analysis, and proper analysis requires the appropriate horsepower and infrastructure to mine an appropriate data set from huge piles of data. To that end, distributed processing platforms such as Hadoop and MapReduce are gaining favor over big iron in the realm of Big Data analytics.
Perhaps the simplest argument for pursuing a distributed infrastructure is the flexibility of scale, in which more commodity hardware can just be thrown at a particular analysis project to increase performance and speed results. That distributed ideology plays well into grid processing and cloud-based services, which can be employed as needed to process data sets.
The primary thing to remember about building a processing platform for Big Data is how processing can scale. For example, many businesses start off small, with a few commodity PCs running a Hadoop-based platform, but as the amount of data and the available sources grow exponentially, the ability to process the data falls exponentially, meaning that designs must incorporate a look-ahead methodology. That is where IT professionals will need to consider available, future technologies to scale processing to their needs.
Cloud-based solutions that offer elastic-type services are a decent way to future-proof a Big Data analytics platform, simply because of the ability of a cloud service to instantly scale to the loads placed upon it.
There is no simple answer to how to process Big Data with the technology choices available today. Nevertheless, major vendors are looking to make the choices easier by providing canned solutions that are based on appliance models, while others are building complete cloud-based Big Data solutions to meet the elastic needs of small and medium businesses looking to leverage Big Data.
CHOOSING AMONG IN-HOUSE, OUTSOURCED, OR HYBRID APPROACHES
The world of Big Data is filled with choices—so many that most IT professionals can become overwhelmed with options, technologies, and platforms. It is almost at the point at which Big Data analytics is required to choose among the various Big Data ideologies, platforms, and tools.
However, the question remains of where to start with Big Data. The answer can be found in how Big Data systems evolve or grow. In the past, working with Big Data always meant working on the scale of a dedicated data center. However, commodity hardware using platforms like Hadoop has changed the dynamic, decreasing storage prices, and open source applications have further lowered the initial cost of entry. These new dynamics allow smaller businesses to experiment with Big Data and then expand the platforms as needed as successes are built.
Once a pilot project has been constructed using open source software with commodity hardware and storage devices, IT managers can then measure how well the pilot platform meets their needs. Only after the processing needs and volume of data increase can an IT manager make a decision on where to head with a Big Data analytics platform, developing one in-house or turning to the cloud.