Chapter 6. Incident Response

As the name suggests, incident response is the set of processes and procedures that are initiated once a security incident has been declared. In modern-day computing, incidents range from a single compromised endpoint to complete network compromises resulting in massive data breaches. Data breaches and enterprisewide attacks are becoming more and more common, and thus incident response has grown in meaning beyond merely these processes and procedures to encompass an entire discipline within information security.

In this chapter we will discuss the various processes involved in incident response, tools and technology options, and the most common forms of technical analysis that you are likely to need to perform during an incident.

Processes

Incident response processes are an integral component of being able to react quickly in the event of an incident, determine a nonincident, operate efficiently during an incident, and improve after an incident. Having processes in place before an incident begins will pay dividends in the long run.

Pre-Incident Processes

The processes associated with incident response are not merely concerned with what happens during an incident. If there are no processes in place to recognize that an incident is taking place, that the incident response process should be initiated, and those responsible for incident response notified, there is little point in having processes to deal with the incident, as they will never be called upon.

The pre-incident processes do not need to be complex; in fact, they should most definitely not be. The point of these processes is merely to determine if there is a potential incident, and to initiate the incident response process—that’s it!

Having been through multiple iterations of internal incident response, we can say that the most effective processes I have worked with include the following:

Leverage existing processes for dealing with events: Most organizations deal with outages, configuration issues, user-reported issues, and other events. Don’t try to set up a parallel set of processes, but leverage what is already there—in all likelihood, the same people who deal with these issues will be the first to hear of an issue anyway. Just modify or supplement existing processes to include calling the incident response contact in the event of an expected incident, much like they already know to call the on-call Unix person when a Linux host fails in the middle of the night.
Define an incident: If you do not define what you class as an incident, you will either get called for every support call or not get called during a breach of four million records. If it is not simple to define what an incident is, you can opt for wording like, “once a manager has determined that an event is a security incident...” This way you have at least defined that any event will have already progressed beyond triage by first-line support and someone experienced enough to make the determination has made a judgment call.

The result of a pre-incident process is nearly always to initiate the IR process by declaring an incident and calling the contact for incident response.

Note

An incident turns out to be nothing can always be downgraded again to a standard operations event. It is better to be called for a suspected incident that transpires to be nothing, than to not be called for fear of a false positive.

It is in everyone’s best interest to communicate clearly and early on. It not only saves time and effort fixing miscommunication and hearsay issues, but also allows those individuals fixing the downtime or incident the time to fully concentrate on the issue at hand. No downtime is too small for proper communication!

Incident Processes

The processes that take place during an incident, particularly from a technology perspective, cannot be too prescriptive. Incidents, like many operational problems, are far too varied and numerous to prescribe precise courses of action for all eventualities. However, there are some processes that are worth sticking to:

Define an incident manager

This does not have to be the same person for every incident, but should be someone who is senior enough to make decisions and empower others to complete tasks. The incident manager will run the response effort and make decisions.

Define internal communications

Communication between everyone working on the incident to avoid duplication of work, promote the sharing of information, and ensure that everyone is working toward a common goal is key. We would recommend:

Open a “war room.” That is, use an office or meeting room to perform the role of center of operations for anyone in the same physical location. This is used as the central point for coordination of efforts.
Keep a conference bridge open in the war room. This allows people who are remote to the physical location to check in, update those in the war room, and obtain feedback. If there is no physical war room, this will often serve as a virtual war room.
Hold regular update meetings. Regular updates allow people to move away, work in a more concentrated fashion, and report back regularly rather than feeling as if they are looked over and reporting back haphazardly. Typically, meeting every hour works well until the situation is well understood.
Allocate the task of communicating internally to stakeholders. Management will typically want to be kept abreast of a larger incident. However, sporadic communication from a number of people can send mixed messages and be frustrating for both management and the incident response team. A single point of communication between the two allows stakeholders to receive frequent, measured updates.

Define external communications

In many cases, but not all, some external communication may be required. Typically, this is because customers or other departments will be affected by the incident in some way. This sort of communication should not be taken lightly as it affects the public image of the organization and internal technology department. If you are considering undertaking any external communications yourself, rather than allowing your corporate communication or PR team to do it, we would suggest you read Scott Roberts’ “Crisis Comms for IR” blog post on the topic.

Determine key goals

By determining the goals that you wish to achieve in the event of an incident, you can ensure that all actions are taken with these goals in mind. By goals we do not mean simply “fix it,” but considerations such as “preserve chain of custody for evidence” or “minimize downtime.” This is discussed in more depth in Chapter 7.

High-level technology processes

As mentioned before, it is difficult to account for all eventualities, so being prescriptive with technology-based remedies may be difficult; however, there are some high-level processes that may be in place. For example, there may be policies regarding taking snapshots of affected systems to preserve evidence, ensuring that staff stop logging in to affected systems, or a blackout on discussing incidents via email in case an attacker is reading internal email and will get a tip off.

Plan for the long haul

Many incidents are over in just a few hours, but many last substantially longer, often weeks. It is tempting to pull in all resources to help on an incident in the hopes of a timely conclusion, but if it becomes clear that this is not the case, you should prepare for a longer term course of action. Ensure people are sent away to get rest so that they can come in and cover the next shift, and keep those working fed and watered to prevent fatigue. Try not to burn everyone out, as this can be a game of endurance.

Post-Incident Processes

Once an incident is over, it is very valuable to hold a lessons-learned session. This allows for feedback regarding what worked well and what worked less well. It also allows you the chance to update processes, determine training requirements, change infrastructure, and generally improve based on what you learned from the incident.

It is recommended that this session be held a short time after the incident closes. This offers a few days for people to reflect on what happened, gather some perspective, and recover, without leaving it so long that memories fade or become distorted with time. Using this session to update documentation, policies, procedures, and standards will also allow for updated tabletops and drills.

Outsourcing

Many people do not wish to manage incidents internally, at least not beyond the initial triage point, and would rather bring in external subject matter expertise as required. This is an option that works for many people. However, we would recommend that if this is the route that you decide to take, negotiate contracts, nondisclosure agreements, and service-level agreements before an incident happens. When you are elbow deep in an incident is not the time to be negotiating with a potential supplier about when they can spare someone and what rates you will have to pay.

Tools and Technology

It would be easy to list a large number of technologies that are typically used by incident response professionals, especially in the field of digital forensics. However, a lack of experience in this area can make it easy to misinterpret results, either via a lack of experience with the specific tools or by not understanding the context of what is fully understanding.

Fully understanding an environment, knowing what the various logs mean, knowing what should and should not be present, and learning how to use the tools that are already present can vastly increase the chances of managing an in-progress incident. Mid-incident is not the time to learn how to conduct a forensic investigation; it is better left to someone who has some prior experience in this field. That said, a high level appreciation of what can happen during an incident can be achieved by reviewing some high-level topics. We also discuss some example tools that which can be used to assess what is happening in an environment during an incident.

Log Analysis

The first port of call, as with any type of operational issue, is of course the humble logfile. Application and operating system logfiles can hold a wealth of information and provide valuable pointers to what has happened.

If logs are stored on the host that generated them, you should remain cognizant of the fact that if someone compromises that host, they can easily modify the logs to remove evidence of what is happening. If possible, the logs stored on your Security Information and Event Management (SIEM) platform should be consulted, rather than referring to logs on the target device. This not only reduces the chances of log tampering but also provides the facility the ability to query logs across the whole estate at once, permitting a more holistic view of the situation. A SIEM also has the ability to show if a gap in logs has occurred.

When reviewing logs on an SIEM, it is likely that the SIEM’s own log query tools and search language will need to be used. It is also possible that the use of commands such as curl or customized scripts will access data via an API.

If the logs are not being accessed on an SIEM, it is recommended to take a copy, if possible, and to analyze them locally with any preferred tools. Personally, we opt for a combination of traditional Unix command-line tools such as grep, awk, sed, and cut, along with scripts written for specific use cases.

Disk and File Analysis

Analysis of artifacts on storage devices can also provide clues as to what has happened during an incident. Typically, a disk image will yield more information than purely examining files, as this contains not only the files stored on the disk that are immediately visible, but also potentially fragments of deleted files that remain on disk, chunks of data left in slack space, and files that have been hidden via root kits. Using a disk image also ensures that you do not accidentally modify the original disk, which ensures the integrity of the original should there be legal proceedings of some kind. To obtain a copy of the disk image traditionally means taking a host down and using a tool such as ddfldd or a commercial equivalent to take an image of the disk, which is saved to another drive and then examined offline. Unfortunately, this causes downtime.

Disk and File Analysis in Virtual Environments

In most virtualized and some cloud computing environments, taking downtime to image a disk is less of a problem because all the major vendors have various snapshot technologies that can be used to take an image of a guest operating system. However, these technologies will often compress disk images, destroying unused space and losing much of this needed information.

Once a disk image has been obtained, various commercial tools can be used to analyze the filesystem to discover files of interest, construct timelines of events, and other related tasks. In the open source/free space, the old classics The Sleuth Kit and Autopsy remain favorites.

If a simple recovery of files is all that is desired, PhotoRec is a simple-to-use tool that yields surprisingly good results. Despite the name, it is not limited to photos.

Memory Analysis

Code that is executing, including malicious code, is resident in RAM. If you can obtain a memory dump from a compromised host—that is, a file that contains a byte-for-byte copy of the RAM—then analysis can be performed to discover malicious code, memory hooks, and other indicators of what has happened.

The most popular tool to analyze these RAM dumps is the Volatility Framework (see the wiki on GitHub).

Obtaining RAM dumps will vary from OS to OS and it is a constantly changing field, so we would recommend checking the Volatility documentation for the latest preferred method.

For virtualized platforms, however, there is no need to dump RAM using the OS, as the host can take an image of the virtual memory. Following are the three most common examples of how to achieve this:

QEMU

pmemsave 0 0x20000000 /tmp/dumpfile

Xen

sudo xm dump-core -L /tmp/dump-core-6 6

VMWare ESX

vim-cmd vmsvc/getallvms
vim-cmd vmsvc/get.summary vmid
vim-cmd vmsvc/snapshot.create vmid [Name] [Description]
  [includeMemory (1)] [quiesced]

PCAP Analysis

If you have any tools that sniff network traffic inline or via a span port or inline utilities such as an IDS/IPS or network monitoring device, there is every chance that you could have sample packet capture (PCAP) files. PCAP files contain copies of data as it appeared on the network and allow an analyst to attempt to reconstruct what was happening on the network at a particular point in time.

A vast number of tools can be used to perform PCAP analysis; however, for a first pass at understanding what is contained in the traffic we would recommend using IDS-like tools such as Snort or Bro Security Monitor configured to read from a PCAP, as opposed to a live network interface. This will catch obvious traffic that triggers their predefined signatures.

Some staples for conducting PCAP analysis include the following tools:

tcpdump produces header and summary information, hex dumps, and ASCII dumps of packets that are either sniffed from the wire or read from PCAP files. Because tcpdump is command line, it can be used with other tools such as sed and grep to quickly determine frequently occurring IP addresses, ports, and other details that could be used to spot abnormal traffic. tcpdump is also useful because it can apply filters to PCAP files and save the filtered output. These output files are themselves smaller PCAPs that can be fed into other tools that do not handle large PCAPs as gracefully as tcpdump does.
Wireshark is the de facto tool for analysis of PCAP data. It provides a full GUI that allows the user to perform functions such as filtering and tracking a single connection, providing protocol analysis and graphing certain features of the observed network traffic. Wireshark does not, however, handle large files very well, and so prefiltering with tcpdump is recommended.
tshark (bundled with Wireshark) is a command-line version of Wireshark. It is not quite as intuitive or easy to use, but being on the command line allows it to be used in conjunction with other tools such as grep, awk, and sed to perform rapid analysis.

All in One

If you are familiar with LiveCDs such as Kali in the penetration testing world, then an approximate equivalent for Incident Response is CAINE. CAINE is a collection of free/open source tools provided on a single LiveCD or USB thumbdrive. It can be booted without prior installation for quick triage purposes.

Conclusion

Incident response is not a prescriptive process from beginning to end. However, there are some key areas that can be process driven, such as communication, roles and responsibilities, and high-level incident management. This allows incidents to be effectively controlled and managed without bogging down technical specialists with complex decision tree processes.

Incident response is an area of information security that most hope they will never have to be involved with; however, when the occasion comes you will be glad that you have prepared.