Chapter 13. Operations and OpSec

Agile teams often come up against a wall with operations, because they can deliver changes much faster than ops can handle. But the walls between dev and ops are coming down, as operations functions move to the cloud, and as operations teams start their own Agile transformations.

This is what the DevOps movement is all about: applying ideas and engineering practices and tools from Agile development and Agile values to operations, and finding ways to bring operations and developers closer together. These development teams don’t hand off work to operations and sustaining engineering, and then go on to the next project. Instead, developers and operations share responsibility for getting the system into production and making sure that the system is running correctly, for the life of the system.

Developers are getting directly involved in packaging and provisioning, runtime configuration and deployment, monitoring and incident response, and other operations functions. As operations moves infrastructure into code, they are adopting the same engineering practices and tools as developers, learning about refactoring and test-driven development and continuous integration. We are starting to see more demand for hybrid roles like site reliability engineers (SREs) patterned after Google: engineers who have a strong background in operations as well as strong software development skills.

But whether developers are working in DevOps teams or not, they need to understand how to deploy and set up their systems correctly and safely, how the system works under real-world load conditions and in failure situations, and what information and tools operations needs to effectively run and monitor and troubleshoot the system. And they should do everything that they can to reduce friction with ops and to improve feedback from production.

In this chapter we’ll look at the natural intersections between dev and ops and security, and how Agile development creates new challenges—and opportunities—for operations and security.

System Hardening: Setting Up Secure Systems

While much of the popular focus of security discussion is around the discovery of application-level bugs and flaws that can be used in subtle ways to gain advantage, the reality is that the correct configuration and setup of the operating environment is imperative to being secure. Designing and building your application to be secure is important, but it isn’t enough.

One of the first things that a pen tester—or an attacker—will do is to enumerate your environment to see the options that may be available in terms of attack. The combination of systems, applications, their interrelations, and the people who used them are often referred to as the attack surface, and it is these building blocks that will be used to map out potential routes to compromise.

An important point to note is that from an attacker’s perspective, people are just as important, and potentially vulnerable, as technology. So from a defender’s perspective, system hardening needs to be done with a clear understanding of how those systems will get used by people in reality and not just in an idealized scenario.

Firewalls that aren’t configured properly, ports that shouldn’t have been left open, default admin passwords, out-of-date operating systems, software packages with known exploits, passwords being shared on an internal wiki page or in a git repo, and other common and easily avoided mistakes all present opportunities that can be used to pick apart an environment in furtherance of an attacker’s goal.

Many of the tools discussed in this book that developers rely on to build software are designed to make it as easy as possible for people like them to pick up and start using right away. As we’ll see in this chapter, as teams continue their Agile development journey and deploy code both more frequently and in a more automated fashion, this can create a serious set of security risks.

Runtime software is no better. Databases, web servers, and even the operating system and security tools are also packaged so that they are simple to set up and get running using their default configs. Attackers know this, and know how to take advantage of it.

This means that after you have provisioned a system and installed the necessary software packages and finally have things working correctly, you’ll need to go through extra steps to harden the system against attack. This applies to your production systems, and, as we’ll see, even extends into your build and test environments. Of course, any systems that have internet-facing components will require even greater levels of scrutiny.

System-hardening steps focus on reducing your attack surface. As Justin Schuh of Google stated succinctly on Twitter: “Security at its core is about reducing attack surface. You cover 90% of the job just by focusing on that. The other 10% is luck.”

Common hardening techniques include things like the following:

  • Disabling or deleting default accounts and changing default credentials

  • Creating effective role separation between human and system principals (i.e., don’t run * as root!)

  • Stripping down the installation and removing software packages, or disabling daemons from auto-starting that are not needed for the system’s role

  • Disabling services and closing network ports that are not absolutely required

  • Making sure that the packages that you do need are up to date and patched

  • Locking down file and directory permissions

  • Setting system auditing and logging levels properly

  • Turning on host-based firewall rules as a defense-in-depth protection

  • Enabling file integrity monitoring such as OSSEC or Tripwire to catch unauthorized changes to software, configuration, or data

  • Ensuring the built-in security mechanisms for each platform, such as full disk encryption, or SIP (System Integrity Protection) on macOS, or ALSR/DEP on Windows, are enabled and functioning correctly.

As we’ll see, this is not a complete list—just a good starting point.

Tip

For a quick introduction on how to harden a Linux server, check out this excerpt from The Defensive Security Handbook (O’Reilly) by Lee Brotherston and Amanda Berlin.

You will need to go through similar steps for all the layers that your application depends on: the network, the OS, VMs, containers, databases, web servers, and so on.

Detailed discussions of each of these areas are books in and of themselves, and continue to evolve over time as the software and systems you use change and have new features added. Ensuring you stay familiar with the best practices for hardening the technologies that make up your runtime stack is one of the key aspects of operational security overall, and is worth investing in as deeply as you are able.

Regulatory Requirements for Hardening

Some regulations such as PCI DSS lay out specific requirements for hardening systems. In Requirement 2.2, PCI DSS requires that you have documented policies for configuring systems that include procedures for:

  • Changing vendor-supplied defaults and removing unnecessary default accounts

  • Isolating the function of each server (using physical isolation, VM partitioning, and/or containers) so that you don’t have applications or services with different security requirements running together

  • Enabling only those processes and services that are required, and removing scripts, drivers, subsystems, and filesystems that aren’t needed

  • Configuring system parameters to prevent misuse

Understanding the specific hardening requirements of whatever regulations you are required to comply with is quite obviously a key aspect of being compliant with those regulations and something you will need to demonstrate during the auditing process.

Hardening Standards and Guidelines

Outside of regulations, there are hardening standards and guidelines for different platforms and technologies that explain what to do and how to do it. Some of the more detailed, standardized guidelines include:

There are also product-specific hardening guides published by vendors such as Red Hat, Cisco, Oracle, and others, as well as guides made freely available by practitioners in various parts of the security space.

The Center for Internet Security is a cross-industry organization that promotes best practices for securing systems. It publishes the Critical Controls, a list of 20 key practices for running a secure IT organization, and the CIS benchmarks, a set of consensus-based security hardening guidelines for Unix/Linux and Windows OS, mobile devices, network devices, cloud platforms, and common software packages. These guidelines are designed to meet the requirements of a range of different regulations.

The CIS checklists are provided free in PDF form. Each guide can run to hundreds of pages, with specific instructions on what to do and why. They are also available in XML format for members to use with automated tools that support the SCAP XCCDF specification.

Hardening guides like the CIS specifications are intended to be used as targets to aim for, not checklists that you must comply with. You don’t necessarily need to implement all of the steps that they describe, and you may not be able to, because removing a package or locking down access to files or user permissions could stop some software from working. But by using a reference like this you can at least make informed risk trade-offs.

Challenges with Hardening

System hardening is not an activity that occurs once and is then forgotten about. Hardening requirements continuously change as new software and features are released, as new attacks are discovered, and as new regulations are issued or updated.

Because many of these requirements tend to be driven by governments and regulators that inevitably involve committees and review boards, the process of publishing approved guidelines is bureaucratic, confusing, and s-l-o-w. It can take months or years to get agreement on what should and shouldn’t be included in any given hardening specifications before being able to get them published in forms that people can use. The unfortunate reality is that many such official guides are for releases of software that are already out of date and contain known vulnerabilities.

It’s a sad irony that you can find clear and approved guidance on how to harden software which has already been replaced with something newer that should be safer to use, if you only knew how to configure it safely. This means that you will often have to start off hardening new software on your own, falling back to first principles of the reduction of attack surface.

As mentioned, many hardening steps can cause things to break, from runtime errors in a particular circumstance, to preventing an application from starting altogether. Hardening changes need to be made iteratively and in small steps, testing across a range of cases along the way to see what breaks.

The problem of balancing hardening against functionality can feel as much an art as a science. This can be even more challenging on legacy systems that are no longer being directly supported by the vendor or author.

Even when using systems that are brand new and supported by their vendor, the process of hardening can leave you feeling very much alone. For example, one of us recently had to harden an enterprise video conferencing system. This required countless emails and phone calls with the vendor trying to get definitive information on the function of the various network ports that the documentation stated needed to be open. The end result: the customer had to explain to the vendor which subset of specific ports was actually needed for operation, in place of the large ranges that the vendor had requested. Regrettably, this is not an uncommon situation, so be prepared to have to put in some groundwork to get your attack surface as small as possible, even with commercial solutions.

Another trend that is making hardening more challenging still is the blurring of the boundary between operating system and online services. All the major general-purpose operating systems now come with a range of features that make use of online services and APIs out of the box. Depending on your environment, this cloudification of the OS can pose significant worries in terms of both security and privacy; and it makes the task of building trust boundaries around endpoints increasingly difficult.

Frustratingly, system and application updates can also undermine your hardening effort, with some updates changing configuration settings to new or default values. As such, your hardening approach needs to include testing and qualification of any new updates that will be applied to those systems being hardened.

Hardening is highly technical and detailed oriented, which means that it is expensive and hard to do right. But all the details matter. Attackers can take advantage of small mistakes in configuration to penetrate your system, and automated scanners (which are a common part of any attacker’s toolbox) can pick many of them up easily. You will need to be careful, and review closely to make sure that you didn’t miss anything important.

Automated Compliance Scanning

Most people learned a long time ago that you can’t effectively do all of this by hand: just like attackers, the people who build systems need good tools.

There are several automated auditing tools that can scan infrastructure configuration and report where they do not meet hardening guides:

  • CIS members can download and use a tool called CIS-CAT, which will scan and check for compliance with the CIS benchmarks.

  • OpenSCAP scans specific Linux platforms and other software against hardening policies based on PCI DSS, STIG, and USGCB, and helps with automatically correcting any deficiencies that are found.

  • Lynis is an open source scanner for Linux and Unix systems that will check configurations against CIS, NIST, and NSA hardening specs, as well as vendor-supplied guidelines and general best practices.

  • Freely contributed system checkers for specific systems can also be found on the internet, examples being osx-config-check and Secure-Host-Baseline

  • Some commercial vulnerability scanners, like Nessus and Nexpose and Qualys, have compliance modules that check against CIS benchmarks for different OSes, database platforms, and network gear.

Note

Compliance scanning is not the same as vulnerability scanning. Compliance scanners check against predefined rules and guidelines (good practices). Vulnerability scanners look for known vulnerabilities such as default credentials and missing patches (bad practices). Of course there are overlaps here, because known vulnerabilities are caused by people not following good practices.

Scanners like Nessus and Nexpose scan for vulnerabilities as well as compliance with specific guidelines, using different policy rules or plug-ins. On Red Hat Linux only, OpenSCAP will scan for compliance violations, as well as for vulnerabilities.

Other scanners, like the OpenVAS project, only check for vulnerabilities. OpenVAS is a fork of Nessus from more than 10 years ago, before Nessus became closed source. It scans systems against a database of thousands of known weaknesses and exploits, although it has built up a bit of reputation for flagging false positives, so be prepared to validate its results rather than taking them as truth.

If you are not scanning your systems on a regular basis as part of your build and deployment pipelines—and correcting problems as soon as the scanners pick them up—then your systems are not secure.

Approaches for Building Hardened Systems

Automation can also be used to build hardened system configurations from the start. There are two basic strategies for doing this:

Golden image

Bake a hardened base template or “golden image” that you will use to stand up your systems. Download a standardized operating system distribution, install it on a stripped-down machine, load the packages and patches that you need, and walk through hardening steps carefully until you are happy. Test it, then push this image out to your production systems.

These runtime system configurations should be considered immutable: once installed, they must not be changed. If you need to apply updates or patches or make runtime configuration changes, you must create a new base image, tear the machines down, and push a completely new runtime out, rebuilding the machines from scratch each time. Organizations like Amazon and Netflix manage their deployments this way because it ensures control at massive scale.

Tools like Netflix’s Aminator and HashiCorp’s Packer can be used to bake machine images. With Packer, you can configure a single system image and use the same template on multiple different platforms: Docker, VMware, and cloud platforms like EC2 or Azure or Google Compute Engine. This allows you to have your development, test, and production on different runtime platforms, and at the same time keep all of these environments in sync.

Automated configuration management

Take a stripped-down OS image, use it to boot up each device, and then build up the runtime following steps programmed into a configuration management tool like Ansible, Chef, Puppet, or Salt. The instructions to install packages and apply patches, and configure the runtime including the hardening steps, are checked in to repos like any other code, and can be tested before being applied.

Most of these tools will automatically synchronize the runtime configuration of each managed system with the rules that you have checked in: every 30 minutes or so, they compare these details and report variances or automatically correct them.

However, this approach is only reliable if all configuration changes are made in code and pushed out in the same way. If engineers or DBAs make ad hoc runtime configuration changes to files or packages or users that are not under configuration management, these changes won’t be picked up and synchronized. Over time, configurations can drift, creating the risk that operational inconsistencies and vulnerabilities will go undetected, and unresolved.

Both of these approaches make system provisioning and configuration faster, safer, and more transparent. They enable you to respond to problems by pushing out patches quickly and with confidence, and, if necessary, to tear down, scratch, and completely rebuild your infrastructure after a breach.

Automated Hardening Templates

With modern configuration management tools like Chef and Puppet, you can take advantage of hardening guidelines that are captured directly into code.

One of the best examples is DevSec, a set of open source hardening templates originally created at Deutsche Telekom, and now maintained by contributors from many organizations.

This framework implements practical hardening steps for common Linux base OS distributions and common runtime components including ssh, Apache, nginx, mysql, and Postgres. A full set of hardening templates are provided for Chef and Puppet, as well as several playbooks for Ansible. All the templates contain configurable rules that you can extend or customize as required.

The hardening rules are based on recognized best practices, including Deutsche Telekom’s internal standards, BetterCrypto.org’s Applied Crypto Hardening guide, which explains how to safely use encryption, and various hardening guides.

Hardening specifications like these are self-documenting (at least for technical people) and testable. You can use automated testing tools like Serverspec, which we looked at in Chapter 12, to ensure that the configuration rules are applied correctly.

DevSec comes with automated compliance tests written using InSpec, an open source Ruby DSL. These tests ensure that the hardening rules written in Chef, Puppet, and Ansible all meet the same guidelines. The DevSec project also includes an InSpec compliance profile for the Docker CIS benchmark and a SSL benchmark test.

InSpec works like Serverspec in that it checks the configuration of a machine against expected results and fails when the results don’t match. You can use it for test-driven compliance, by writing hardening assertions that will fail until the hardening steps are applied, and you can also use these tests to verify that new systems are set up correctly.

InSpec tests are specifically designed to be shared between engineers and compliance auditors. Tests are written in simple English, and you can annotate each scenario to make the intention explicit to an auditor. For each testing rule, you can define the severity or risk level, and add descriptions that match up to compliance checklists or regulatory requirements. This makes it easy to walk through the test code with an auditor and then demonstrate the results.

Here are a couple of examples of compliance tests from the InSpec GitHub:

Only accept requests on secure ports - This test ensures that a web server is
only listening on well-secured ports.

describe port(80) do
  it { should_not be_listening }
end

describe port(443) do
  it { should be_listening }
  its('protocols') {should include 'tcp'}
end

Use approved strong ciphers - This test ensures that only
enterprise-compliant ciphers are used for SSH servers.

describe sshd_config do
   its('Ciphers') ↵
   { should eq('chacha20-poly1305@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr') }
end

InSpec is also the basis of Chef’s Compliance Server product. Developers at Chef are working on translating automated hardening profiles like CIS from SCAP XML into InSpec, so that these rules can be automatically checked against your complete infrastructure.

Network as Code

In traditional systems development, network security and application security are mostly done independently of each other. Developers may need a firewall ACL opened or a routing change, or they might need know how to deal with proxies in a DMZ. But otherwise these worlds don’t touch each other much.

This is changing as applications are moved to the cloud, where more network rules and management capabilities are exposed to developers—and where developers can no longer rely on perimeter defenses like network firewalls to protect their applications.

Network appliances such as firewalls, intrusion detection or prevention systems, routers, and switches are still often set up and configured (and hardened) by hand, using custom scripts or device consoles. But over the last few years, network vendors have added REST APIs and other programmable interfaces to support software-defined networking; and tools like Ansible, Chef, and Puppet have added support for programmatically configuring network devices, including switches and firewalls from providers like Cisco Systems, Juniper Networks, F5 Networks, and Arista Networks.

Taking advantage of this tooling, “network as code” could become the same kind of game changer as “infrastructure as code” has been for managing servers and storage and cloud service platforms. All the same advantages apply:

  • Network device configurations can be versioned and managed in source control, providing an audit trail for change management and for forensics.

  • Configuration rules for network devices can be expressed using high-level languages and standardized templates, making changes simpler and more transparent to everyone involved.

  • Network configuration changes can be reviewed and tested in advance, using some of the same tools and practices that we’ve described, instead of relying on ping and traceroute after the fact.

  • Changes can be automatically applied across many different devices to ensure consistency.

  • Network configuration changes can be coordinated with application changes and other infrastructure changes, and deployed together, reducing risk and friction.

In many organizations the network operations group is a completely separate team that may well predate the systems operations team, which means embracing concepts like those above will require the same kind of cultural and engineering changes that are taking place in the way that server infrastructure is managed in DevOps teams. But it holds open the promise of making network configuration and management more open to collaboration, simpler to understand, and simpler and safer to change. All of which is good for security.

Monitoring and Intrusion Detection

It is a fact of modern life that you cannot bring up a computer or network device on the internet without it being port scanned within minutes. Your detection systems must be able to tell the difference between the background noise of failing attempts by bots scanning your system perimeter and successful breaches, as well as distinguishing between normal and attacker behavior: is that low and slow data exfiltration, or the periodic update of a stock ticker?

In large organizations, security monitoring is usually done by a specialized security operations center (SOC), staffed by analysts sifting through attack data, intrusion alerts, and threat feeds. In smaller organizations, security monitoring might be outsourced to a managed security services provider (MSSP) like SecureWorks or Symantec—or it might not be done at all. This kind of monitoring is mostly focused at the network level, examining network traffic for known bad signatures or unusual patterns in activity.

Operations engineers monitor servers, networks, storage, databases, and running applications using tools like Nagios or services like Zabbix or New Relic to check on system status and to watch out for indications of slowdowns or runtime problems: hardware failures, disks filling up, or services crashing.

Applications monitoring is done to check on the health of the application, and to capture metrics on use. How many transactions were executed today? Where is most of the traffic coming from? How much money did we make? How many new users tried out the system? And so on.

Monitoring to Drive Feedback Loops

Agile, and especially DevOps and Lean startup teams, use information about how the system is running and how it is being used to shape design, making data-driven decisions about which features are more usable or more useful, and what parts of the system have reliability or quality problems. They build feedback loops from testing and from production to understand where they need to focus, and where they need to improve. This is quite different than many traditional approaches to security where the security goals are defined in advance and delivered in the forms of edicts that must be met.

DevOps teams can be metrics-obsessed, in the same way that many Agile teams are test-obsessed. Monitoring technologies like Etsy’s statsd+carbon+graphite stack, Prometheus, Graylog, or Elastic Stack, or cloud-based monitoring platforms like Datadog or Stackify, make it relatively cheap to create, collect, correlate, visualize, and analyze runtime metrics.

DevOps leaders like Etsy, Netflix, and PayPal track thousands of individual metrics across their systems and capture hundreds of thousands of metrics events every second. If someone thinks that a piece of information might be interesting, they make it a metric. Then they track it and graph it to see if anything interesting or useful jumps out from the rest of the background noise. At Etsy, they call this: “If it moves, graph it,” and are the first to admit that they “worship at the church of graph.”

Using Application Monitoring for Security

The same tools and patterns can also be used to detect security attacks and to understand better how to defend your system.

Runtime errors like HTTP 400/500 errors could indicate network errors or other operations problems—or they could be signs of an attacker trying to probe and spider the application to find a way in. SQL errors at the database level could be caused by programming bugs or a database problem, or by SQL injection attempts. Out of memory errors could be caused by configuration problems or coding mistakes—or by buffer overflow attacks. Login failures could be caused by forgetful users, or by bots trying to break in.

Developers can help operations and the security team by designing transparency and visibility into their applications, adding sensors to make operational and security-related information available to be analyzed and acted on.

Developers can also help to establish baselines of “normal” system behavior. Because they have insight into the rules of the application and how it was designed to be used, they should be able to see exceptions where other people may just see noise in logs or charts. Developers can help to set thresholds for alerts, and they will jump when an “impossible” exception happens, because they wrote the assert which said that this could never happen. For example, if a server-side validation fails and they know that the same check was also implemented at the client, this means that someone is attacking the system using an intercepting proxy.

Alerting and Analytics Tools

There are some cool alerting frameworks and toolsets available to help you implement real-time application operations and security alerting.

If you are using the Elasticsearch-Logstash-Kibana (ELK) stack as your monitoring stack, you might want to look at using Yelp’s Elastalert, which is built on top of Elastic Search.

Etsy’s 411 alert management system is also based on ELK, and helps with setting up alert rules and handling alert workflows.

AirBnB has open sourced StreamAlert, a comprehensive serverless real-time data analysis framework.

You can use this information not only to catch attackers—hopefully before they get too far—but also to help you to prioritize your security work. Zane Lackey at Signal Sciences, and former head of Etsy’s security engineering team, calls this attack-driven defense. Watch what attackers are doing, learn how they work, and identify the attacks that they are trying. Then make sure that you protect your system against these attacks. A vulnerability found by a code scanner or in a threat modeling review might be important. But a vulnerability that you can see attackers actively trying to exploit in production is critical. This is where you need to immediately focus your testing and patching efforts.

Detection systems for compromise should be set up on isolated networks and devices, otherwise compromise of a machine can enable the attackers to disable the detection systems themselves (although a good security system will notice a machine that does not check-in as well). Network taps and parallel monitoring and detection tools work particularly well in this regard.

It is also worth recognizing ahead of time that the use of SSL/TLS within your environment not only increases the cost and challenge posed to adversaries intercepting traffic, but also to your security and network operations team who are trying to do the same—albeit with different intent. If you are switching from a non-SSL’d network setup internally to one that includes encryption on the internal links, enter the new architecture with an idea of the places in which you will require monitoring and design accordingly.

To protect from insider attacks, make sure that you set up monitoring tools and services so that they are are administered separately from your production systems, to prevent local system administrators from being able to tamper with the security devices or policies, or disable logging. You should also pay special attention to any staff who can or do rotate between security and systems administration teams.

Auditing and Logging

Auditing and logging are fundamental to security monitoring, and are important for compliance, for fraud analysis, for forensics, and for nonrepudiation and attack attribution: defending against claims when a user denies responsibility for an action, or helping to make a case against an attacker.

Auditing is about maintaining a record of activity by system principals: everything that they did, when they did it, in what order. Audit records are intended for compliance and security analysts and fraud investigators. These records must be complete in order for them to be useful as evidence, so auditing code needs to be reviewed for gaps in coverage, and audit records need to be sequenced to detect when records are missing.

When designing an auditing system, you will have to carefully balance requirements to record enough information to provide a clear and complete audit trail against transactional overheads and the risk of overloading the backend auditing and analytics systems, especially if you are feeding this information into a SIEM or other security analytics platform.

But the guiding principle should be that as long as you can afford it, you should try to audit everything that a user does, every command or transaction, and especially every action taken by administrators and operations. With the cost of storage on a seemingly never-ending decline, the decision on how much audit data should be kept can be made based on privacy and data retention regulations, rather than budget.

In distributed systems, such as microservices architectures, audit logs should ideally be pooled from as many systems as possible to make it easy for auditors to join information together into a single coherent story.

Application logging is a more general-purpose tool. For operations, logging helps in tracking what is happening in a system, and identifying errors and exceptions. As a developer, logging is your friend in diagnosing and debugging problems, and your fallback when something fails, or when somebody asks, “How did this happen?” Logs can also be mined for business analytics information, and to drive alerting systems. And for security, logs are critical for event monitoring, incident response, and forensics.

One important decision to make in designing your logging system is whether the logs should be written first for people to read them, or for programs to parse them. Do you want to make log records friendlier and more verbose for a human reader, or more structured and efficient so that they can be more easily interpreted by a tool to drive alerting functions and IDS tools or dashboards?

With either approach, ensure that logging is done consistently within the system, and if possible, across all systems. Every record should have common header information, clearly identifying the following:

  • Who (userid and source IP, request ID).

  • Where (node, service ID, and version—version information is especially important when you are updating code frequently).

  • What type of event (INFO, WARN, ERROR, SECURITY).

  • When (in a synchronized time format).

Teams should agree on using a logging framework to take care of this by default, such as Apache Logging Services, which include extensible logging libraries for Java, .NET, C++, and PHP.

As we outlined in Chapter 5, some regulations such as PCI DSS dictate what information must be, and should not be, logged, and how long certain logs need to be retained.

At a minimum, be sure to log:

  • System and service startup and shutdown events.

  • All calls to security functions (authentication, session management, access control, and crypto functions), and any errors or exceptions encountered in these functions.

  • Data validation errors at the server.

  • Runtime exceptions and failures.

  • Log management events (including rotation), and any errors or exceptions in logging.

Be careful with secrets and sensitive data. Passwords, authentication tokens, and session IDs should never be written to logs. If you put personally identifiable information (PII) or other sensitive data into your logs, you may be forcing your log files and log backup system into scope for data protection under regulations such as PCI DSS, GLBA, or HIPAA.

Tip

For more information about how and what to log, and what not to log, from an application security perspective, check out OWASP’s Logging Cheat Sheet. Also make sure to read “How to do Application Logging Right” by Dr. Anton Chuvakin and Gunnar Peterson.

Another important consideration is log management.

Logging should be done to a secure central logging system so that if any individual host is compromised, the logs can’t be destroyed or tampered with, and so that attackers can’t read the logs to identify weaknesses or gaps in your monitoring controls. You can consolidate logs from different applications and nodes using a log shipper like rsyslog, logstash, or fluentd, or cloud-based logging services like loggly or papertrail.

Log rotation and retention has to be done properly to ensure that logs are kept for long enough to be useful for operations and security forensic investigation, and to meet compliance requirements: months or even years, not hours or days.

Proactive Versus Reactive Detection

You won’t be able to find all security problems in near real time using immediate runtime detection or online analytics (despite what many vendors may claim!). Finding a needle in a haystack of background noise often requires sifting through big data to establish correlations and discover connections, especially in problems like fraud analysis, where you need to go back in time and build up models of behavior that account for multiple different variables.

When thinking about detection, many people only consider the ability to identify and respond to events as they happen, or how to catch and block a specific attack payload at runtime. But when you see a problem, how can you tell if it is an isolated incident, or something that has been going on for a while before you finally noticed it?

Many security breaches aren’t identified until several weeks or months after the system was actually compromised. Will your monitoring and detection systems help you to look back and find what happened, when it happened, who did it, and what you need to fix?

You also need to make cost and time trade-off decisions in monitoring. Flooding ops or security analysts with alerts in order to try to detect problems immediately isn’t effective or efficient. Taking extra time to sum up and filter out events can save people wasted time and prevent, or at least reduce, alert fatigue.

Catching Mistakes at Runtime

The more changes that you make to systems, the more chances that there are to make mistakes. This is why Agile developers write unit tests: to create a safety net to catch mistakes. Agile teams, and especially DevOps teams that want to deploy changes to test and production more often, will need to do the same thing for operations, by writing and running runtime checks to ensure that the application is always deployed correctly, configured correctly, and is running safely.

Netflix and its famous “Simian Army” show how this can be done. This is a set of tools which run automatically and continuously in production to make sure that their systems are always configured and running correctly:

  • Security Monkey automatically checks for insecure policies, and records the history of policy changes.

  • Conformity Monkey automatically checks configuration of a runtime instance against pre-defined rules and alerts the owner (and security team) of any violations.

  • Chaos Monkey and her bigger sisters Chaos Gorilla and Chaos Kong are famous for randomly injecting failures into production environments to make sure that the system recovers correctly.

Netflix’s monkeys were originally designed to work on Amazon AWS, and some of them (Chaos Monkey, and more recently, Security Monkey) run on other platforms today. You can extend them by implementing your own rules, or you can use the same ideas to create your own set of runtime checkers, using an automated test framework like Gauntlt, which we looked at in Chapter 12, or InSpec, or even with simpler test frameworks like JUnit or BATS.

These tests and checks can become part of your continuous compliance program, proving to auditors that your security policies are continuously being enforced. Besides preventing honest mistakes, they can also help to catch attacks in progress, or even discourage attackers.

An often overlooked benefit of having an environment that is in constant and unpredictable state of flux (often referred to as embracing chaos engineering), is that it presents a moving target to an adversary, making it harder to enumerate the operational environment, as well as making it harder for the adversary to gain a persistent foothold. Neither is made impossible, but the cost of the attack is increased.

When writing these checks, you need to decide what to do when an assert fails:

  • Isolate the box so that you can investigate.

  • Immediately terminate the service or system instance.

  • Try to self-repair, if the steps to do so are clear and straightforward: enabling logging or a firewall rule, or disabling an unsafe default.

  • Alert the development team, operations, or security.

Safe SSL/TLS

Setting up SSL/TLS correctly is something that most people who build or operate systems today need to understand—and something that almost everybody gets wrong.

Ivan Ristic at Qualys SSL Labs provides detailed and up-to-date guidelines on how to correctly configure SSL/TLS.

SSL Labs also provides a free service which you can use to check out the configuration of your website against the guidelines provided, and give you a report card. A big fat red F will force people to pick up the guidelines and figure out how to do things properly.

Making sure that SSL is set up correctly should be a standard part of your tests and operational checks. This is why both Gauntlt and BDD-Security include tests using sslyze, a standalone security tool. This is something that you need to do continuously: the cryptography underpinning SSL and TLS is under constant attack, and what got you an A last week might be only a D today.

Runtime Defense

In addition to monitoring for attacks and checking to make sure that the system is configured safely, you may want to—or need to—add additional protection at runtime.

Traditional network IPS solutions and signature-based Web Application Firewalls (WAFs) that sit somewhere in front of your application aren’t designed to keep up with rapid application and technology changes in Agile and DevOps. This is especially true for systems running in the cloud, where there is no clear network perimeter to firewall off, and where developers may be continuously deploying changes to hundreds or thousands of ephemeral runtime instances across public, private, and hybrid cloud environments.

Cloud Security Protection

Recognizing the risks that come with the newer operational approaches, there are a number of security startups offering different runtime protection services for applications in the cloud, including automated attack analysis, centralized account management and policy enforcement, continuous file integrity monitoring and intrusion detection, automated vulnerability scanning, or micro-segmentation:

These services hook into cloud platform APIs and leverage their own analytics and threat detection capabilities to replace yesterday’s enterprise security “black boxes.”

Other startups like Signal Sciences now offer smarter next-generation web application firewalls (NGWAFs) that can be deployed with cloud apps. They use transparent anomaly detection, language dissection, continuous attack data analysis, and machine learning, instead of signature-based rules, to identify and block attack payloads.

A word of caution for all things machine learning! Big data and machine learning are the hottest topics in tech right now; they are irresistible for vendors to include in their products with wild proclamations of them being the silver bullet security has been waiting for. As with all vendor claims, approach from the point of skepticism and inquire for details as to exactly what machine learning techniques are being employed, and for data to back up the relevance to solving the problem at hand rather than just having another buzzword bingo box ticked. False positives can often be a killer and the tuning periods required to train the models to the actual production environment have undermined more than one deployment of the next great security-panacea-in-a-box.

RASP

Another kind of defensive technology is runtime application self-protection (RASP), which instruments the application runtime environment to catch security problems as they occur. Like application firewalls, RASP can automatically identify and block attacks as they happen. And like application firewalls, you can use RASP to protect legacy apps or third-party apps for which you don’t have source code.

But unlike a firewall, RASP is not a perimeter-based defense. RASP directly instruments and monitors your application runtime code, and can identify and block attacks directly at the point of execution. RASP tools have visibility into application code and runtime context, and examine data variables and statements using data flow, control flow, and lexical analysis techniques to detect attacks while the code is running. They don’t just tell you about problems that could happen—they catch problems as they happen. This means that RASP tools generally have a much lower false positive (and false negative) rate than application firewalls or static code analysis tools.

Most of these tools can catch common attacks like SQL injection and some kinds of XSS, and include protection against specific vulnerabilities like Heartbleed using signature-based rules. You can use RASP as a runtime defense solution in blocking mode, or to automatically inject logging and auditing into legacy code, and provide insight into the running application and attacks against it.

There are only a handful of RASP solutions available today, mostly from small startups, and they have been limited to applications that run in the Java JVM and .NET CLR, although support for other platforms like Node.js, PHP, Python, and Ruby are now starting to emerge:

RASP can be a hard sell to a development team—and an even harder sell to the operations team. You have to convince team members to place their trust in the solution’s ability to accurately find and block attacks, and to accept the runtime overhead that it imposes. RASP also introduces new points of failure and operational challenges: who is responsible for setting it up, making sure that it is working correctly, and dealing with the results?

OWASP AppSensor: Roll Your Own RASP

OWASP’s AppSensor project provides a set of patterns and sample Java code to help you to implement application-layer intrusion detection and response directly inside your app.

AppSensor maps out common detection points in web applications: entry points in the code where you can add checks for conditions that should not happen under normal operations. Then it defines options for how to deal with these exceptions, when to log information and when you should consider blocking an attack and how to do it. It shows you how to automatically detect—and protect against—many common attacks.

But RASP could provide a compelling “quick fix” for teams under pressure, especially teams trying to support insecure legacy or third-party systems, or that are using Web Application Firewalls to protect their apps and are fighting with the limitations of this technology.

Incident Response: Preparing for Breaches

Even if you do everything already discussed, you still need to be prepared to deal with a system failure or security breach if and when it happens. Create playbooks, call trees, and escalation ladders so that people know who has to be involved. Map out scenarios in advance so that when bad things happen, and everyone is under pressure, they know what to do and how to do it without panicking.

Security Breaches and Outages: Same, but Different

Operations teams and security incident response teams both need to act quickly and effectively when something goes wrong. They need to know how and when to escalate, and how to communicate with stakeholders. And they need to learn from what happened so that they can do a better job next time.

But their priorities are different. When you are dealing with an operational outage or serious performance problem, the team’s goal is to restore service as quickly as possible. Security incident response teams need to make sure that they understand the scope of the attack and contain its impact before they recover—and get snapshot runtime images and collect logs for forensic analysis before making changes.

Operational and security concerns can cross over, for example, in the case of a DDOS attack. The organization needs to decide which is more important, or more urgent: restore service, or find the source of the attack and contain it?

Get Your Exercise: Game Days and Red Teaming

Planning and writing playbooks isn’t enough. The only way to have confidence that you can successfully respond to an incident is to practice doing it.

Game Days

Amazon, Google, Etsy, and other online businesses regularly run game days, where they run through real-life, large-scale failures, such as shutting down an entire data center to make sure that their failover procedures will work correctly—and that they are prepared for exceptions that could come up.

These exercises can involve (at Google, for example) hundreds of engineers working around the clock for several days, to test out disaster recovery cases and to assess how stress and exhaustion could impact the organization’s ability to deal with real accidents.

At Etsy, game days are run in production, even involving core functions such as payments handling. Of course, this begs the question, “Why not simulate this in a QA or staging environment?” Etsy’s response is, first, the existence of any differences in those environments brings uncertainty to the exercise; second, the risk of not recovering has no consequences during testing, which can bring hidden assumptions into the fault tolerance design and into recovery. The goal is to reduce uncertainty, not increase it.

These exercises are carefully tested and planned in advance. The team brainstorms failure scenarios and prepares for them, running through failures first in test, and fixing any problems that come up. Then, it’s time to execute scenarios in production, with developers and operators watching closely, and ready to jump in and recover, especially if something goes unexpectedly wrong.

Less grand approaches to game days can also be pursued on a regular basis, with scenarios like those shared by the Tabletop Scenarios Twitter stream being a great source of what if conversations to promote brainstorming among team members over coffee.

Red Team/Blue Team

You can take many of the ideas from game days, which are intended to test the resilience of the system and the readiness of the DevOps team to handle system failures, and apply them to security attack scenarios through Red Teaming.

Organizations like Microsoft, Intuit, Salesforce, and several big banks have standing Red Teams that continuously attack their live production systems. Other organizations periodically run unannounced Red Team exercises, using internal teams or consultants, to test their operations teams and security defenses, as we outlined in Chapter 12.

Red Teaming is based on military Capture the Flag exercises. The Red Team—a small group of attackers—tries to break into the system (without breaking the system), while a Blue Team (developers and operations and security engineers) tries to catch them and stop them. Red Team exercises try to follow real-world attack examples so that the organization can learn how attacks actually happen and how to deal with them.

Some of these exercises may only run for a few hours, while others can go on for days or weeks to simulate an advanced, persistent threat attack.

The Red Team’s success is measured by how many serious problems it finds, how fast it can exploit them (mean time to exploit), and how long it can stay undetected.

The Blue Team may know that an attack is scheduled and what systems will be targeted, but it won’t know the details of the attack scenarios. Blue Teams are measured by MTTD and MTTR: how fast they detected and identified the attack, and how quickly they stopped it, or contained and recovered from it.

Red Teaming gives you the chance to see how your system and your ops team behave and respond when under attack. You learn what attacks look like, to train your team how to recognize and respond to attacks, and, by exercising regularly, to get better and faster at doing this. And you get a chance to change how people think, from attacks being something abstract and hypothetical, to being something tangible and immediate.

Over time, as the Blue Team gains experience and improves, as it learns to identify and defend against attacks, the Red Team will be forced to work harder, to look deeper for problems, to be more subtle and creative. As this competition escalates, your system—and your security capability—will get stronger.

Intuit, for example, runs Red Team exercises the first day of every week (they call this Red Team Mondays). The Red Team identifies target systems and builds up its attack plans throughout the week, and publishes its targets internally each Friday. The Blue Teams for those systems will often work over the weekend to prepare, and to find and fix vulnerabilities on their own, to make the Red Team’s job harder. After the Red Team Monday exercises are over, the teams get together to debrief, review the results, and build action plans. And then it starts again.

Most organizations won’t be able to build this kind of capability in-house. It takes a serious commitment and serious skills. As we discussed in Chapter 12, you may need to pull in outside help to help run Red Team exercises.

Blameless Postmortems: Learning from Security Failures

Game days and Red Team exercises are important learning opportunities. But it’s even more important to learn as much as you can when something actually goes wrong in production, when you have an operational failure or security breach. When this happens, bring the team together and walk through a postmortem exercise to understand what happened, why, and how to prevent problems like this from happening again.

Postmortems build on some of the ideas of Agile retrospectives, where the team meets to look at what it’s done, what went well, and how it can get better. In a postmortem, the team starts by going over the facts of an event: what happened, when it happened, how people reacted, and then what happened next. Included are dates and times, information available at the time, decisions that were made based on that information, and the results of those decisions.

By focusing calmly and objectively on understanding the facts and on the problems that came up, the team members can learn more about the system and about themselves and how they work, and they can begin to understand what they need to change.

Note

There are several sources of information that you can use to build up a picture of what happened, why, and when, in your postmortem analysis. This includes logs, emails, records of chat activity, and bug reports. Etsy has open-sourced Morgue, an online postmortem analysis tool which has plug-ins to pull information from sources like IRC and Jira, as well as logs and monitor snapshots, to help create postmortem reports.

Facts are concrete, understandable, and safe. Once the facts are on the table, the team can start to ask why errors happened and why people made the decisions that they made, and then explore alternatives and find better ways of working:

  • How can we improve the design of the system to make it safer or simpler?

  • What problems can we catch in testing or reviews?

  • How can we help people to identify problems earlier?

  • How can we make it easier for people to respond to problems, to simplify decision making, and reduce stress—through better information and tools, or training or playbooks?

For this to work, the people involved need to be convinced that the real goals of the postmortem review are to learn—and not to find a scapegoat to fire. They need to feel safe to share information, be honest and truthful and transparent, and to think critically without being criticized or blamed. We’ll talk more about how to create a blameless and trusting working environment, and look more at issues around trust and learning in postmortems, in Chapter 15, Security Culture.

Operations failures are in many ways easier to deal with than security breaches. They are more visible, easier to understand and to come up with solutions. Engineers can see the chain of events, or they can at least see where they have gaps and where they can improve procedures or the system itself.

Security breaches can take longer to understand, and the chain of causality is not always that clear. You may need to involve forensics analysts to sift through logs and fill in the story enough for the team to understand what the vulnerabilities were and how they were exploited, and there is often a significant time lag between when the breach occurred and when it was detected. Skilled attackers will often try to destroy or tamper with evidence needed to reconstruct the breach, which is why it is so important to protect and archive your logs.

Securing Your Build Pipeline

Automating your build, integration, and testing processes to pull from your repositories upon commit into a beautifully automated pipeline is a key milestone in the Agile life cycle. It does, however, come with a new set of responsibilities—you now have to secure the thing!

Given the tasks you delegate to your build pipeline and the privileges it will inevitably require, it rapidly becomes a critical piece of infrastructure and therefore a very attractive target of attack.

This attractiveness increases again if you are fully embracing continuous deployment, where each change is automatically deployed into production after it passes through testing.

Build and deployment automation effectively extends the attack surface of your production system to now include your build environment and toolchains. If your repositories, build servers, or configuration management systems are compromised, things get very serious, very quickly.

If such a compromise provides read access, then data, source code, and secrets such as passwords and API keys are all open to being stolen. If the compromise provides write or execution privileges, then the gloves really come off, with the possibility for backdooring applications, injection of malware, redirection or interception of production traffic, or the destruction of your production systems.

Even if a test system is compromised, it could provide an attacker with enough of a path back into the automated pipeline to cause damage.

If you lose control of the build pipeline itself, you also lose your capability to respond to an attack, preventing you from pushing out patches or hot fixes.

In addition to protecting your pipeline from outside attackers, you need to protect it from being compromised by insiders, by ensuring that all changes are fully authorized and transparent and traceable from end to end so that a malicious and informed insider cannot bypass controls and make changes without being detected.

Do a threat model of your build pipeline. Look for weaknesses in the setup and controls, and gaps in auditing or logging. Then, take these steps to secure your configuration management environment and build pipeline:

  1. Harden the runtime for your configuration management, build, and test environments.

  2. Understand and control what steps are done in the cloud.

  3. Harden your build, continuous integration, and continuous delivery toolchains.

  4. Lock down access to your configuration management tools.

  5. Protect keys and secrets.

  6. Lock down source and binary repos.

  7. Secure chat platforms (especially if you are following ChatOps).

  8. Regularly review logs for your configuration management, build, and test environments.

  9. Use Phoenix Servers for build slaves and testing—create these environments from scratch each time you need them, and tear them down when you are done.

  10. Monitor your build pipelines the same way that you do production.

That’s a lot. Let’s explore each of these steps in more detail.

Harden Your Build infrastructure

Harden the systems that host your source and build artifact repositories, the continuous integration and continuous delivery server(s), and the systems that host configuration management, build, deployment, and release tools. Treat them the same way that you treat your most sensitive production systems.

Review firewall rules and network segmentation to ensure that these systems, as well as your development and test systems, are not accidentally exposed to the public internet, and that development and production environments are strictly separated. Take advantage of containers and virtual machines to provide additional runtime isolation.

Understand What’s in the Cloud

Scaling build and testing using cloud services is easy and attractive. Ensure that you clearly understand—and control—what parts of your build and test pipelines are done on-premises and what is performed in the cloud. The use of services in the cloud has many advantages, but can introduce trust issues and significantly expand the attack surface that you need to manage.

Holding your code repos in the cloud using GitHub, Gitlab, or BitBucket makes perfect sense for open source projects (of course), and for startups and small teams. These services provide a lot of value for little or no investment. But they are also juicy targets for attackers, increasing the risk that your code, and whatever you store in your code (like secrets), could be compromised in a targeted attack, or scooped up in a widespread breach.

Cloud-based code repository managers like GitHub are under constant threat. This is because attackers know where to look, and what to look out for. They know that developers aren’t always as careful as they should be about using strong passwords, and resist using multifactor authentication and other protections, and they also know that developers sometimes mistakenly store proprietary code in public repos. These mistakes, and coding or operational mistakes by the service providers, have led to a number of high-profile breaches over the past few years.

If your developers are going to store proprietary code in the cloud, take appropriate steps to protect the code:

  1. Ensure that developers use strong authentication (including multifactor authentication).

  2. Carefully check that private repos are, in fact, private.

  3. Monitor your GitHub repos using a tool like GitMonitor.

  4. Regularly scan or review code to make sure that it does not contain credentials, before it gets committed.

The potential impact here extends beyond just your organization’s repositories and into the personal repositories of your developers. The propensity for developers to share their dotfiles with the world has been the initial entry point for many an internet to internal network compromise.

If you are using hosted continuous integration and build services like Travis CI or Codeship, check to make sure that access is set up correctly, and make sure that you understand their security and privacy policies.

Finally, as with any SaaS solution, it becomes imperative that you control access to those systems tightly and revoke access permissions when people switch roles or leave the company. If your investment in cloud-based services is nontrivial, then you should consider using a single-sign-on (SSO) solution to control access to those applications via a single identity. Maintaining multiple, distinct accounts across different systems is an invitation for things to slip through the cracks during off-boarding. Given the global accessibility (by design) of cloud based applications, the use of multifactor authentication (MFA) to help protect against the risks of credential theft becomes a baseline requirement.

Harden Your CI/CD Tools

Harden your automated build toolchain. Most of these tools are designed to be easy for developers to get set up and running quickly, which means that they are not secure by default—and it may be difficult to make them safe at all.

Jenkins, one of the most popular automated build tools, is a good example. Although the latest version includes more security capabilities, most of them are turned off out of the box, including basic authentication and access control, as the website explains:1

In the default configuration, Jenkins does not perform any security checks. This means the ability of Jenkins to launch processes and access local files are available to anyone who can access Jenkins web UI and some more.

Take time to understand the tool’s authorization model and what you can do to lock down access. Set up trust relationships between build masters and servers, and enable whatever other security protection is available. Then logically separate build pipelines for different teams so that if one is compromised, all of them won’t be breached.

After you finish locking down access to the tools and enabling security controls, you also need to keep up with fixes and updates to the tools and any required plug-ins. Security advisories for some of these tools, even the most popular, can’t always be relied on. Watch out for new patches and install them when they are made available. But always test first to make sure that the patch is stable and that you don’t have dependency problems: build chains can become highly customized and fragile over time.

Continuous Integration Tools Are a Hacker’s Best Friend

“Running a poorly configured CI tool is like providing a ready-to-use botnet for anyone to exploit.”

Check out security researcher Adrian Mittal’s presentation from Black Hat 2015, where he walks through serious vulnerabilities that he found in different CI/CD tools, including Jenkins, Go, and TeamCity.2

Lock Down Configuration Managers

If you are using configuration management tools like Chef or Puppet, you must lock them down. Anyone with access to these tools can add accounts, change file permissions or auditing policies, install compromised software, and alter firewall rules. Absent controls, it’s like granting someone root on all boxes under configuration control. Who needs the root password when someone else will kindly type in all the commands for you?

Configure the tools safely, and restrict access to only a small, trusted group of people—and audit everything that they do. For an example, see the article from Learn Chef Rally, “How to be a secure Chef”.

Protect Keys and Secrets

A continuous delivery pipeline needs keys and other credentials to automatically provision servers and deploy code. And the system itself needs credentials in order to start up and run. Make sure that these secrets aren’t hardcoded in scripts or plain-text config files or in code. We’ll look at how to manage secrets safely in the next section.

Lock Down Repos

Lock down source and binary repos, and audit access to them. Prevent unauthenticated or anonymous or shared user access to repos, and implement access control rules.

Source repos hold your application code, your tests and sample test data, your configuration recipes, and if you are not careful, credentials and other things that you don’t want to share with attackers. Read access to source repos should be limited to people on the team, and to other teams who work with them.

Anybody with write access to your code repos can check-in a malicious change. Make sure that this access is controlled and that check-ins are continuously monitored.

Binary repos hold a cache of third-party libraries downloaded from public repos or vendors, and the latest builds of your code in the pipeline. Anybody with write access to these repos can inject malcode into your build environment—and eventually into production.

You should verify signatures for any third-party components that you cache for internal use, both at the time of download, and when the components are bundled into the build. And change the build steps to sign binaries and other build artifacts to prevent tampering.

Secure Chat

If you are using collaborative ChatOps tools like Slack or Mattermost and GitHub’s Hubot to help automate build, test, release, and deployment functions, you could be opening up another set of security risks and issues.

Collaborative chat tools like Slack and HipChat provide a simple, natural way for development and operations teams and other people to share information. Chatbots can automatically monitor and report the status of a system, or track build and deployment pipelines, and post information back to message rooms or channels. They can also be used to automatically set up and manage build, test, release, and deployment tasks and other operations functions, through simple commands entered into chat conversations.

What rooms are channels and are public or private? Who has access to these channels? Are any of them open to customers, or to other third parties, or to the public? What information is available in them?

What are bots set up to do? Who has access to them? Where do the bots run? Where are the scripts?

You’ll need to take appropriate steps to secure and lock down the chain of tools involved: the chat tool or chat platform, the bots, and the scripts and plug-ins that team members want to use. Treat chat automation scripts like other operations code: make sure that it is reviewed and checked in to a repo.

Many of these collaboration tools are now in the cloud, which means that information that team members post is hosted by someone outside of your organization. You must review the chat service provider’s security controls, and security and privacy policies. Make sure that you track the provider’s security announcements and that you are prepared to deal with outages and breaches.

And make sure that teams understand what information should and should not be posted to message rooms or channels. Sharing passwords or other confidential information in messages should not be allowed.

Control and audit access to chat tools, including using MFA if it is supported. If your team is big enough, you may also need to set up permission schemes for working with bots, using something like hubot-auth.

Remember that bots need credentials—to the chat system and to whatever other systems or tools that they interact with. These secrets need to be protected, as we will see later in this chapter.

Review the Logs

Your build system logs need to be part of the same operations workflows as your production systems. Periodically review the logs for the tools involved to ensure that they are complete and that you can trace a change through from check-in to deployment. Ensure that the logs are immutable, that they cannot be erased or forged. And make sure that they are regularly rotated and backed up.

Use Phoenix Servers for Build and Test

Use automated configuration management and provisioning tools like Chef or Puppet, Docker (especially), Vagrant, and Terraform to automatically stand up, set up, patch, and tear down build slaves and test servers as and when you need them.

Try to treat your build and test boxes as disposable, ephemeral “Phoenix Servers” that only exist for the life of the test run. This reduces your attack surface, regularly tests your configuration management workflows, and gives you more confidence when you deploy to a hardened production environment.

Don’t Let Your Test Data Get You into Trouble

Because it is difficult to create good synthetic test data, many shops take a snapshot or a subset of production data and then anonymize or mask certain fields. If this isn’t done properly, it can lead to data breaches or other violations of privacy laws and compliance regulations.

You need strong controls over how the snapshots are handled, and reliable (and carefully reviewed) methods to ensure that identifying information like names, addresses, phone numbers, email IDs, as well as any passwords or other credentials, and other sensitive information are removed and replaced with random data, or otherwise scrubbed before this data can be used in test.

Make sure that all the steps involved are audited and reviewed, so that you can prove to customers and compliance auditors that this information was protected.

Monitor Your Build and Test Systems

Ensure that all of these systems are monitored as part of the production environment. Operational security needs to be extended to the tools and the infrastructure that they run on, including vulnerability scanning, IDS/IPS, and runtime monitoring. Use file integrity checking to watch for unexpected or unauthorized changes to configurations and data.

Shh…Keeping Secrets Secret

Keeping secrets secret is a problem in every system. The application needs keys, passwords and user IDs, connection strings, AWS keys, code signing keys, API tokens, and other secrets that need to be protected. Operations engineers and administrators need access to these secrets—and so do their tools.

As you automate more of the work of configuring, testing, deploying, and managing systems, the problems of managing secrets gets harder. You can’t just share secrets between a handful of people. And you don’t want to store secrets in scripts or plain-text configuration files—or in source code.

Storing secrets in code is a bad idea. Code is widely accessible, especially if you are using a DVCS like Git, where every developer has her own copy, which means that every developer has access to the system secrets. If you need to change secrets, you will have to make a code change and re-deploy. And code has a way of getting out, exposing your secrets to the outside.

You Can’t Keep Secrets Secret on GitHub

There have been several cases where people in high-profile organizations have been found posting passwords, security keys, and other secrets in searchable, public repos on GitHub.3

This includes several well-intentioned people who uploaded Slackbot code up to public GitHub repos, accidentally including their private Slack API tokens, which can be harvested and used to eavesdrop on Slack communications or impersonate Slack users.

In a famous recent example, Uber confirmed that an attacker compromised its driver database using a database key which had been accidentally posted on GitHub.4

Regularly scan GitHub using Gitrob or Truffle Hog to check for files in public repos that could contain sensitive information from your organization.

You could use tools like git-secret or StackExchange’s BlackBox or git-crypt to transparently encrypt secrets and other configuration information and confidential code, as you check them into a repo.

But this still exposes you to risks. What if someone forgets to encrypt a sensitive file?

You need to scan or review code to make sure that credentials aren’t checked in to repos, or take advantage of pre-commit hooks to add checks for passwords and keys. Talisman, an open source project from ThoughtWorks, Git Hound, and git-secrets are tools that can be used to automatically check for secrets in code and block them from being checked in.

Continuous integration and continuous delivery servers need access to secrets. Deployment scripts and release automation tools need secrets. Your configuration management tools like Ansible, Chef, or Puppet need credentials in order to set up other credentials. You can use the following to manage secrets for these tools:

But this only solves one piece of the problem.

A much better approach is to use a general-purpose secrets manager across all of your tools as well as for your applications. Secrets managers do the following:

  1. Safely store and encrypt passwords, keys, and other credentials at rest.

  2. Restrict and audit access to secrets, enforcing authentication and fine-grained access control rules.

  3. Provide secure access through APIs.

  4. Handle failover so that secrets are always available to system users.

Some open source secrets managers include the following:

  1. Keywhiz from Square.

  2. Knox the secrets keeper used at Pinterest.

  3. Confidant from Lyft, to manage secrets on AWS.

  4. CredStash a simple secret keeper on AWS.

  5. HashiCorp Vault, arguably the most complete and operationally ready of the open source secrets management tools.

Key Takeaways

Secure and reliable system operations presents challenging problems, and there is a bewildering choice of tools and new engineering techniques that you could consider using to try to solve them.

Where should you start? Where can you get the best ROI, the most bang for your buck?

  • Your people are a key component of your build pipeline and should be considered as part of its attack surface in addition to the technology.

  • In rapidly changing environments, vulnerability scanning needs to be done on an almost continuous basis. Scanning once a year or once a quarter won’t cut it.

  • Security hardening must be built into system provisioning and configuration processes—not done as an afterthought.

  • Automating system provisioning and configuration management using programmable tools like Ansible or Chef, containers such as Docker, imaging tools like Packer, and cloud templating technologies, should be a foundation of your operations and security strategy.

    With these technologies you can ensure that every system is set up correctly and consistently across development, test, and production. You can make changes quickly and consistently across hundreds or even thousands of systems in a safe and testable way, with full transparency and traceability into every change. You can automatically define and enforce hardening and compliance policies.

  • Getting operations configuration into code, and getting operations and development using the same tools and build pipelines, makes it possible to apply the same security controls and checks to all changes, including code reviews, static analysis checks, and automated testing.

  • Your automated build pipeline presents a dangerous attack surface. It should be treated as part of your production environment, and managed in the same way as your most security-sensitive systems.

  • Secrets have to be kept secret. Private keys, API tokens, and other credentials are needed by tools and often need to be used across trust boundaries. Don’t store these secrets in scripts, config files, or source code. Get them into a secure secrets manager.

  • Security that is built into application monitoring feedback loops, makes security issues more transparent to developers as well as to operations.

  • Prepare for incidents—for serious operational and security problems. Bad things will happen. Make sure that operations and developers understand what they need to do and how they can help.

    Practice, practice, practice. Run regular game days, or Red Team/Blue Team games or other exercises to work out problems and build organizational muscles.

  • Postmortem reviews after operational problems and security incidents provide the team with an opportunity to learn, and, if done properly, in an open and sincere and blameless way, to build connections and trust between teams.

1 Jenkins, “Securing Jenkins”, Apr 15, 2016.

2 Nikhil Mittal, “Continuous Intrusion: Why CI tools are an attacker’s best friend”, presentation at Black Hat Europe 2015.

3 Dan Goodin, “PSA: Don’t upload your important passwords to GitHub”, Ars Technica, 1/24/2013.

4 Dan Goodin, “In major goof, Uber stored sensitive database key on public GitHub page”, Ars Technica, 3/2/2015.