This section discusses the collection and storage of data for use in analysis and response. Effective security analysis requires collecting data from widely disparate sources, each of which provides part of a picture about a particular event taking place on a network.
To understand the need for hybrid data sources, consider that most modern bots are general-purpose software systems. A single bot may use multiple techniques to infiltrate and attack other hosts on a network. These attacks may include buffer overflows, spreading across network shares, and simple password cracking. A bot attacking an SSH server with a password attempt may be logged by that host’s SSH logfile, providing concrete evidence of an attack but no information on anything else the bot did. Network traffic might not be able to reconstruct the sessions, but it can tell you about other actions by the attacker—including, say, a successful long session with a host that never reported such a session taking place, no siree.
The core challenge in data-driven analysis is to collect sufficient data to reconstruct rare events without collecting so much data as to make queries impractical. Data collection is surprisingly easy, but making sense of what’s been collected is much harder. In security, this problem is complicated by the rare actual security threats.
Attacks are common, threats are rare. The majority of network traffic is innocuous and highly repetitive: mass emails, everyone watching the same YouTube video, file accesses. Interspersed among this traffic are attacks, but the majority of the attacks will be automated and unsubtle: scanning, spamming, and the like. Within those attacks will be a minority, a tiny subset representing actual threats.
That security is driven by rare, small threats means that almost all security analysis is I/O bound: to find phenomena, you have to search data, and the more data you collect, the more you have to search. To put some concrete numbers on this, consider an OC-3: a single OC-3 can generate 5 terabytes of raw data per day. By comparison, an eSATA interface can read about 0.3 gigabytes per second, requiring several hours to perform one search across that data, assuming that you’re reading and writing data across different disks. The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times. It is completely possible to instrument oneself blind.
A well-designed storage and query system enables analysts to conduct arbitrary queries on data and expect a response within a reasonable time frame. A poorly designed one takes longer to execute the query than it took to collect the data. Developing a good design requires understanding how different sensors collect data; how they complement, duplicate, and interfere with each other; and how to effectively store this data to empower analysis. This section is focused on these problems.
This section is divided into seven chapters. Chapter 1 is an introduction to the general process of sensing and data collection, and introduces vocabulary to describe how different sensors interact with each other. Chapter 2 discusses the collection of network data—its value, points of collection, and the impact of vantage on network data collection. Chapter 3 discusses sensors and outputs. Chapter 4 focuses on service data collection and vantage. Chapter 5 focuses on the content of service data—logfile data, its format, and converting it into useful forms. Chapter 6 is concerned with host-based data, such as memory or filesystem state, and how that affects network data analysis. Chapter 7 discusses active domain data, scanning and probing to find out what a host is actually doing.