Rapid Cybersecurity Ops

Attack, Defend, and Analyze with bash

Paul Troncone and Carl Albing

Rapid Cybersecurity Ops

by Paul Troncone and Carl Albing

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • Editor: Virginia Wilson
  • Production Editor: Justin Billing
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • May 2019: First Edition

Revision History for the First Early Release

  • 2018-10-09: First Release
  • 2018-11-27: Second Release
  • 2019-01-28: Third Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492041313 for release details.

Chapter 1. Command Line Primer

The command line is one of the oldest interfaces used to interact with a computer. The command line has evolved over several decades of use and development, and is still an extremely useful and powerful way to interface with a computer. In many cases, the command line can be faster and more efficient than a Graphical User Interface (GUI) at accomplishing a task.

The bash shell and command language will be used for demonstrations throughout this book. That is due to its wide-scale adoption across multiple computing platforms and rich command set.

Commands and Arguments

The basic operation of bash is to execute a command, that is, to run another program. When several words appear on the command line bash assumes that the first word is the name of the program to run and the remaining words are the arguments to the command. For example:

mkdir -p /tmp/scratch/garble

will have bash run the command called mkdir and it will pass it two arguments -p and /tmp/scratch/garble. By convention programs generally put their options first, and have them begin with a leading "-“, as is the case here with the -p option. This particular command will create a directory called /tmp/scratch/garble. The -p option will mean that no errors will be reported and any intervening directories will be created (or attempted) as needed (e.g., if only /tmp exists, it will create /tmp/scratch before attempting to create /tmp/scratch/garble).

Standard Input/Output/Error

A running program is called a process and every process in the Unix/Linux/Posix (and thus Windows) environment has three distinct input/output file descriptors. These three are called “standard input” (or stdin, for short), “standard output” (stdout), and “standard error” (stderr).

As you might guess by its name, stdin is the default source for input to a program, by default the characters coming from the keyboard. When your script reads from stdin it is reading characters typed on the keyboard or (as we shall see shortly) it can be changed to read from a file. Stdout is the default place for sending output from a program. By default the output appears in the window which is running your shell or shell script. Standard error can also be sent output from a program, but it is (or should be) where error messages are written. It’s up to the person writing the program to direct any output to either stdout or stderr. So be conscientious when writing your scripts to send any error messages not to stdout but to stderr (as shown below).

Redirection and Piping

One of the great innovations of the shell was that it gave you a mechanism whereby you could take a running program and change where it got its input and/or change where it sent its output without modifying the program itself. If you have a program called handywork and it reads its input from stdin and writes its results to stdout, then you can change its behavior as simply as this:

handywork < data.in  > results.out

which will run handywork but will have the input come not from the keyboard but instead from the data file called data.in (assuming such a file exists and has input in the format we want). Similarly the output is being sent not to the screen but into a file called results.out (which will be created if it doesn’t exist and overwritten if it does). This technique is called “redirection” because we are re-directing input to come from a different place and re-directing output to go somewhere other than the screen.

What about stderr? The syntax is similar. We have to distinguish between stout and stderr when redirecting data coming out of the program and we make this distinction through the use of the file descriptor numbers. Stdin is file descriptor 0, stdout is file descriptor 1 and stderr is file descriptor 2 so we can redirect error messages this way:

handywork 2> err.msgs

which will redirect only stderr and send any such error message output to a file we called err.msgs (for obvious reasons).

Of course we can do all three on the same line:

handywork < data.in  > results.out  2> err.msgs

Sometimes we want the error messages combined with the normal output (as it does by default when both are written to the screen). We can do this with the following syntax:

handywork < data.in  > results.out 2>&1

which says to send stderr (2) to the same location as file descriptor 1 (”&1“). Note that without the ampersand, the error messages would just be sent to a file named “1”. This combining of stdout and stderr is so common that there is a useful shorthand notation:

handywork < data.in  &> results.out

If you want to discard standard output you can redirect it to a special file called /dev/null as follows:

handywork < data.in > /dev/null

To view output on the command line and simultaneously redirect that same output to a file, use the tee command. The following will display the output of handywork to the screen and also save it to results.out:

handywork < data.in | tee results.out

A file will be created or truncated (i.e., contents discarded) when output is redirected. If you want to preserve the file’s existing content you can, instead, append to the file using a double greater-than sign, like this:

handywork < data.in  >> results.out

This will execute handywork and then any output from stdout will be appended to the file results.out rather than overwriting its existing content.

Similarly this command line:

handywork < data.in  &>> results.out

will execute handywork and then append both stdout and stderr to the file results.out rather than overwriting its existing content.

Running Commands in the Background

Throughout this book we will be going beyond one-line commands and will be building complex scripts. Some of these scripts can take a significant amount of time to execute, so much so that you may not want to spend time waiting for them to complete. Instead, you can run any command or script in the background using the & operator. The script will continue to run, but you can issue other commands and/or run other scripts. For example, to run ping in the background and redirect standard output to a file:

ping 192.168.10.56 > ping.log &

You will likely want to redirect standard output and/or standard error to a file when sending tasks to the background, or the task will continue to print to the screen and interrupt other activities you are performing.

Warning

Be cautious not to confuse &, which is used to send a task to the background, and &> which is used to perform a combined redirect of standard output and standard error.

You can use the jobs command to list any tasks currently running in the background.

$ jobs
[1]+  Running                 ping 192.168.10.56 > ping.log &

Use the fg command and the corresponding job number to bring the task back into the foreground.

$ fg 1
ping 192.168.10.56 > ping.log

If your task is currently executing in the foreground you can use CTRL-Z to suspend the process and then bg to continue the process in the background. From there you can use jobs and fg as described above.

From Command Line to Script

A shell script is just a file that contains the same commands that you could type on the command line. Put one or more commands into a file and you have a shell script. If you called your file myscript you can run that script by typing: bash myscript or you can give it “execute permission” (e.g., chmod 755 myscript) and then you can invoke it directly: ./myscript to run the script. We often include the following line as the first line of the script, which tells the operating system which scripting language we are using:

#!/bin/bash -

Of course this assumes that bash is located in the /bin directory. If your script needs to be more portable, you could use this approach instead:

#!/usr/bin/env bash

It uses the env command to look up the location of bash and is considered the standard way to address the portability problem. It makes the assumption, however, that the env command is to be found in /usr/bin.

Summary

In this chapter you saw how to run single commands and redirect input and output. In the next chapter we will discuss the real power of scripting, which comes from being able to run commands repeatedly, make decisions in the script, and loop over a variety of inputs.

Exercises

  1. Write a command that executes ifconfig and redirects standard output to a file named ipaddress.txt.

  2. Write a command that executes ifconfig and redirects standard output and appends it to a file named ipaddress.txt.

  3. Write a command that copies all of the files in the directory /etc/a to the directory /etc/b and redirects standard error to the file copyerror.log.

  4. Write a command that performs a directory listing (ls) on the root file directory and pipes the output into the more command.

  5. Write a command that executes mytask.sh and sends it to the background.

  6. Given the job list below, write the command that brings the Amazon ping task to the foreground.

    [1]   Running                 ping www.google.com > /dev/null &
    [2]-  Running                 ping www.amazon.com > /dev/null &
    [3]+  Running                 ping www.oreilly.com > /dev/null &

Chapter 2. Bash Primer

Bash should be thought of as a programming language whose default operation is to launch other programs. Here is a brief look at some of the features that make bash a powerful programming language, especially for scripting.

Output

As with any programming language, bash has the ability to output information to the screen. Output can be achieved using the echo command.

$ echo "Hello World"

Hello World

You may also use the printf command which allows for some additonal formatting.

$ printf "Hello World"

Hello World

Variables

Bash variables begin with an alphabetic character or underscore followed by alphanumeric characters. They are string variables unless declared otherwise. To assign a value to the variable, you write something like this:

MYVAR=textforavalue

To retrieve the value of that variable, for example to print out the value using the echo command, you use the $ in front of the variable name, like this:

echo $MYVAR

If you want to assign a series of words to the variable, that is, to preserve any whitespace, use quotation marks around the value, as in:

MYVAR='here is a longer set of words'
OTHRV="either double or single quotes will work"

The use of double quotes will allow other substitutions to occur inside the string. For example:

firstvar=beginning
secondvr="this is just the $firstvar"
echo $secondvr

This will result in the output: this is just the beginning

There are a variety of substitutions that can occur when retrieving the value of a variable; we will show those as we use them in the scripts to follow.

Warning

Remember that by using double quotes (") any substitutions that begin with the $ will still be made, whereas inside single quotes (') no substitutions of any sort are made.

You can also store the output of a shell command using $( ) as follows:

CMDOUT=$(pwd)

That will execute the command pwd in a sub-shell and rather than printing the the result to stdout, it will store the output of the command in the variable CMDOUT. You can also pipe together multiple commands within the $ ( ).

Positional Paramaters

It is common when using command line tools to pass data into the commands using arguments or parameters. Each parameter is separated by the space character and is accessed inside of bash using a special set of identifiers. In a bash script, the first parameter passed into the script can be accessed using $1, the second using $2, and so on. $0 is a special parameter that holds the name of the script, and $# returns the total number of parameters. Take the following script:

Example 2-1. echoparams.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# echoparams.sh
#
# Description:
# Demonstrates accessing parameters in bash
#
# Usage:
# ./echoparms.sh <param 1> <param 2> <param 3>
#

echo $#
echo $0
echo $1
echo $2
echo $3

This script first prints out the number of parameters ($#), then the name of the script ($0), and then the first three parameters. Here is the output:

$ ./echoparams.sh bash is fun

3
./echoparams.sh
bash
is
fun

Input

User input is received in bash using the read command. The read command will obtain user input from the command line and store it in a specified variable. The script below reads user input into the MYVAR variable and then prints it to the screen.

read MYVAR
echo "$MYVAR"

Conditionals

Bash has a rich variety of conditionals. Many, but not all, begin with the keyword if.

Any command or program that you invoke in bash may do some output but it will also always return a success or fail value. In the shell this value can be found in the $? variable immediately after a command has run. A return value of 0 is considered “success” or “true”; any non-zero value is considered “error” or “false”. The simplest form of the if statement uses this fact. It takes the form:

if cmd
then
   other cmds
fi

For example, the script below attempts to change directories to /tmp. If that command is successful (returns 0) the body of the if statement will execute.

if  cd /tmp
then
    echo "here is what is in /tmp:"
    ls -l
fi

Bash can even handle a pipeline of commands in a similar fashion:

if ls | grep pdf
then
    echo found one or more pdf files here
fi

With a pipeline, it is the success/failure of the last command in the pipeline that determines if the “true” branch is taken. Here is an example where that fact matters:

ls | grep pdf | wc

This series of commands will be “true” even if no pdf string is found by the grep command. That is because the wc command (a word count of the input) will print:

0       0       0

That output indicates 0 characters, 0 words, and 0 lines when no output comes from the grep command. That is still a successful (or true) result, not an error or failure. It counted as many lines as it was given, even if it was given zero lines to count.

A more typical form of if used for comparisons makes use of the compound command [[ or the shell built-in commands [ or test. Use these to test file attributes or to make comparisons of value.

To test if a file exists on the file system:

if [[ -e $FILENAME ]]
then
    echo $FILENAME exists
fi

Table 2-1 lists additional tests that can be done on files using if comparisons.

Table 2-1. File Test Operators
File Test Operator Use

-d

Test if a directory exists

-e

Test if a file exists

-r

Test if a file exists and is readable

-w

Test if a file exists and is writable

-x

Test if a file exists and is executable

To test if the variable $VAL is less than the variable $MIN:

if [[ $VAL -lt $MIN ]]
then
    echo "value is too small"
fi

Table 2-2 lists additional numeric tests that can be done using if comparisons.

Table 2-2. Numeric Test Operators
Numeric Test Operator Use

-eq

Test for equality between numbers

-gt

Test if one number is greater than another

-lt

Test if one number is less than another

Warning

Be cautious of using the < symbol. Take the following code:

if [[ $VAL < $OTHR ]]

This operator is a less-than but in this context it uses lexical (alphabetical) ordering. That means that 12 is less than 2, since they alphabetically sort in that order. (Just like a < b, so 1 < 2, but also 12 < 2anything)

If you want to do numerical comparisons with the less-than sign, use the double parentheses construct. It assumes that the variables are all numerical and will evaluate them as such. Empty or unset variables are evaluated as 0. Inside the parentheses you don’t need the $ operator to retrieve a value, except for positional parameters like $1 and $2 (so as not to confuse them with the constants 1 and 2). For example:

if (( VAL < 12 ))
then
    echo "value $VAL is too small"
fi

In bash you can even make branching decisions without an explicit if/then construct. Commands are typically separated by a newline - that is, they appear one per line. You can get the same effect by separating them with a semicolon. If you write cd $DIR ; ls then bash will perform the cd and then the ls.

Two commands can also be separated by either && or || symbols. If you write cd $DIR && ls then the ls command will run only if the cd command succeeds. Similarly if you write cd $DIR || echo cd failed the message will be printed only if the cd fails.

You can use the [[ syntax to make various tests, even without an explicit if.

[[ -d $DIR ]] && ls "$DIR"

means the same as if you had written

if [[ -d $DIR ]]
then
  ls "$DIR"
fi
Warning

When using && or || you will need to group multiple statements if you want more than one action within the “then” clause. For example:

[[ -d $DIR ]] || echo "error: no such directory: $DIR" ; exit

will always exit, whether or not $DIR is a directory.

What you probably want is this:

[[ -d $DIR ]] || { echo "error: no such directory: $DIR" ; exit ; }

where the braces will group both statements together.

Looping

Looping with a while statement is similar to the if construct in that it can take a single command or a pipeline of commands for the decision of true or false. It can also make use of the brackets or parentheses as in the if examples, above.

In some languages braces ( { } ) are used to group the statement together that are the body of the while loop. In others, like python, indentation is the indication of which statements are the loop body. In bash, however, the statements are grouped between two keywords: do and done.

Here is a simple while loop:

i=0
while (( i < 1000 ))
do
    echo $i
    let i++
done

The loop above will execute while the variable i is less than 1000. Each time the body of the loop executes it will print the value of i to the screen. It then uses the let command to execute i++ as an arithmetic expression, thus incrementing i by 1 each time.

Here is a more complicated while loop that executes commands as part of its condition.

while ls | grep -q pdf
do
    echo -n 'there is a file with pdf in its name here: '
    pwd
    cd ..
done

A for loop is also available in bash - in three variations.

Simple numerical looping can be done using the double parentheses construct. It looks much like the for loop in C or Java, but with double parentheses and with do and done instead of braces:

for ((i=0; i < 100; i++))
do
    echo $i
done

Another useful form of the for loop is used to iterate through all the parameters that are passed to a shell script (or function within the script), that is, $1, $2, $3 an so on. Note that ARG in args.sh can be replaced with any variable name of your choice.

for ARG
do
    echo here is an argument: $ARG
done

Here is the output of args.sh when three parameters are passed in.

$ ./args.sh bash is fun

here is an argument: bash
here is an argument: is
here is an argument: fun

Finally, for an arbitrary list of values, use a similar form of the for statement simply naming each of the values you want for each iteration of the loop. That list can be explicitly written out, like this:

for VAL in 20 3 dog peach 7 vanilla
do
    echo $VAL
done

The values used in the for loop can also be generated by calling other programs or using other shell features:

for VAL in $(ls | grep pdf) {0..5}
do
    echo $VAL
done

Here the variable VAL will take, in turn, the value for each of the filenames that ls piped into grep finds with the letters pdf in its filename (e.g. “doc.pdf” or “notapdfile.txt”) and then each of the numbers 0 through 5. It may not be that sensible to have the variable VAL be a filename sometimes and a single digit another time, but this shows you that it can be done.

Functions

Define a function with sytnax like this:

function myfun ()
{
  # body of the function goes here
}

Not all that syntax is necessary - you can use either "function" or "()" - you don’t need both. We recommend, and will be using, both - mostly for readability.

There are a few important considerations to keep in mind with bash functions:

  • Unless declared with the local builtin command inside the function, variables are global in scope. A for loop which sets and increments i could be messing with the value of i used elsewhere in your code.

  • The braces are the most commonly used grouping for the function body, but any of the shell’s compound command syntax is allowed - though why, e.g., would you want the function to run in a sub-shell?

  • Redirecting I/O on the braces does so for all the statements inside the function. Examples of this will be seen in upcomoing chapters.

  • No parameters are declared in the function definition. Whatever and however many arguments are supplied on the invocation of the function are passed to it.

The function is called (invoked) just like any command is called in the shell. Having defined myfun as a function you can call it like this:

myfun 2 /arb "14 years"

which calls the function myfun supplying it with 3 arguments.

Function Arguments

Inside the function defintion arguments are referred to in the same way as parameters to the shell script — as $1, $2, etc. Realize that this means that they “hide” the parameters originally passed to the script. If you want access to the script’s first parameter, you need to store $1 into a variable before you call the function (or pass it as a paramter to the function).

Other variables are set accordingly, too. $# gives the number of arguments passed to the function, whereas normally it gives the number of arguments passed to the script itself. The one exception to this is $0 - it doesn’t change in the function. It retains its value as the name of the script (and not of the function).

Returning Values

Functions, like commands, should return a status - a 0 if all goes well and a non-zero value if some error has occurred. To return some other kinds of values - pathnames or computed values for example - you can either set a variable to hold that value - since those variables are global unless declared local within the function, or you can send the result to stdout, that is, print the answer. Just don’t try to do both.

Warning

If you print the answer you’ll typically use that output as part of a pipeline of commands (e.g., myfunc args | next step | etc ) or you’ll capture the output like this: RESVAL=$( myfunc args ). In both cases the function will be run in a sub-shell and not in the current shell. Thus changes to any global variables will only be effective in that sub-shell and not in the main shell instance. They are effectively lost.

Pattern Matching in bash

When you need to name a lot of files on a command line, you don’t need to type each and every name. Bash provides pattern matching (sometimes called “wildcarding”) to allow you to specify a set of files with a pattern. The easiest one is simply an asterisk * (or “star”) which will match any number of any characters. When used by itself, therefore, it matches all files in the current directory. The asterisk can be used in conjunction with other characters. For example \*.txt matches all the files in the current directory which end with the four characters .txt. The pattern /usr/bin/g\* will match all the files in /usr/bin that begin with the letter g.

Another special character in pattern matching is ? the question mark, which matches a single character. For example, source.? will match source.c or source.o but not source.py or source.cpp.

The last of the three special pattern matching characters are [ ], the square brackets. A match can be made with any one of the characters listed inside the square brackets, so the pattern x[abc]y matches any or all of the files named xay, xby, or xcy, assuming they exist. You can specify a range within the square brackets, like [0-9] for all digits. If the first character within the brackets is either a \! or a ^ then the pattern means anything other than the remaining characters in the brackets. For example, [aeiou] would match a vowel whereas [^aeiou] would match any character except the vowels (including digits and punctuation characters).

Similar to ranges, you can specify character classes within braces. Table 2-3 lists the character classes and their description.

Table 2-3. Pattern Matching Character Classes
Character Class Description

[:alnum:]

Alphanumeric

[:alpha:]

Alphabetic

[:ascii:]

ASCII

[:blank:]

Space and Tab

[:ctrl:]

Control Characters

[:digit:]

Number

[:graph:]

Anything Other Than Control Characters and Space

[:lower:]

Lowercase

[:print:]

Anything Other Than Control Characters

[:punct:]

Punctuation

[:space:]

Whitespace Including Line Breaks

[:upper:]

Uppercase

[:word:]

Letters, Numbers, and Underscore

[:xdigit:]

Hexadecimal

Character classes are specified like this: [:cntrl:] within square brackets (so you have two sets of []). For example, this pattern: \*[[:punct:]]jpg will match any filename that has any number of any characters followed by a punctuation character followed by the letters jpg. So it would match files named wow!jpg or some,jpg or photo.jpg but not a file named this.is.myjpg since there is no punctuation character right before the jpg.

There are more complex aspects of pattern matching if you turn on the shell option extglob (like this: shopt -s extglob) so that you can repeat patterns or negate patterns. We won’t need these in our example scripts but we encourage you to learn about them (e.g., via the bash man page).

There are a few things to keep in mind when using shell pattern matching:

  • Patterns aren’t regular expressions (discussed later); don’t confuse the two.

  • Patterns are matched against files in the file system; if the pattern begins with a pathname (e.g., /usr/lib ) then the matching will be done against files in that directory.

  • If no pattern is matched, the shell will use the special pattern matching characters as literal characters of the filename; for example, if your script says echo data > /tmp/*.out but there is no file in /tmp that ends in .out then the shell will create a file called *.out in the /tmp directory. Remove it like this: rm /tmp/\*.out by using the backslash to tell the shell not to pattern match with the asterisk.

  • No pattern matching occurs inside of quotes (either double or single quotes), so if your script says echo data > "/tmp/*.out" it will create a file called /tmp/*.out (which we recommend you avoid doing).

Note

The dot, or period, is just an ordinary character and has no special meaning in shell pattern matching - unlike in regular expressions which will be discussed later.

Writing Your First Script - Detecting Operating System Type

Now that we have gone over the fundamentals of the command line and bash you are ready to write your first script. The bash shell is available on a variety of platforms including Linux, Windows, macOS, and Git Bash. As you write more complex scripts in the future it is imperative that you know what operating system you are interacting with as each one has a slightly different set of commands available. The osdetect.sh script helps you in making that determination.

The general idea of the script is that it will look for a command that is unique to a particular operating system. The limitation is that on any given system an administrator may have created and added a command with that name, so this is not foolproof.

Example 2-2. osdetect.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# osdetect.sh
#
# Description:
# Distinguish between MS-Windows/Linux/MacOS
#
# Usage:
# Output will be one of: Linux MSWin macOS
#

if type -t wevtutil &> /dev/null           1
then
    OS=MSWin
elif type -t scutil &> /dev/null           2
then
    OS=macOS
else
    OS=Linux
fi
echo $OS
1

We use the type built-in in bash to tell us what kind of a command (alias, keyword, function, built-in, or file) its arguments are. The -t option tells it to print nothing if the command isn’t found. The command returns as “false” in that case. We redirect all the output (both stdout and stderr) to /dev/null thereby throwing it away, as we only want to know if the wevtutil command was found.

2

Again we use the type built-in but this time we are looking for the scutil command which is available on macOS systems.

Summary

The bash shell can be seen as a programming language, one with variables and if/then/else statements, loops, and functions. It has its own syntax, similar in many ways to other programming languages, but just different enough to catch you if you’re not careful.

It has its strengths - like easily invoking other programs or connecting sequences of other programs - and it has its weaknesses: it doesn’t have floating point arithmetic or much support (though some) for complex data structures.

In the chapters ahead we will describe and use many bash features and OS commands in the context of cybersecurity operations. We will further explore some of the features we have touched on here, and other more advanced or obsure features. Keep your eyes out for those featues and practice and use them for your own scripting.

Exercises

  1. Experiment with the uname command, seeing what it prints on the various operating systems. Re-write the osdetect.sh script to use the uname command, possibly with one of its options. Caution: not all options are available on every operating system.

  2. Modify the osdetect.sh script to use a function. Put the if/then/else logic inside the function and then call it from the script. Don’t have the function itself do any output. Make the output come from the main part of the script.

  3. Set the permissions on the osdetect.sh script to be executable (see man chmod) so that you can run the script without using bash as the first word on the command line. How do you now invoke the script?

  4. Write a script called argcnt.sh that tells how many arguments are supplied to the script.

    1. Modify your script to have it also echo each argument one per line.

    2. Modify your script further to label each argument like this:

      $ bash argcnt.sh this is a "real live" test
      there are 5 arguments
      arg1: this
      arg2: is
      arg3: a
      arg4: real live
      arg5: test
      $
  5. Modify argcnt.sh so it only lists the even arguments.

Chapter 3. Regular Expressions

Regular expressions (regex) are a powerful method for describing a text pattern to be matched by various tools. There is only one place in bash where regular expressions are valid, using the =~ comparison in the [[ compound command, as in an if statement. However, regular expressions are a crucial part of the larger toolkit for commands like grep, awk, and sed in particular. They are very powerful and thus worth knowing. Once mastered, you’ll wonder how you ever got along without them.

For many of the examples in this chapter we will be using the file frost.txt with its seven, yes seven, lines of text.

Example 3-1. frost.txt
1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

The content of Frost.txt will be used to demonstrate the power of regular expressions to process text data. This text was chosen because it requires no prior knowledge to understand.

Commands in Use

We introduce the grep family of commands to demonstrate the basic regex patterns.

grep

The grep command searches the content of the files for a given pattern and prints any line where the pattern is matched. To use grep, you need to provide it with a pattern and one or more filenames (or piped data).

Common Options

-c

Count the number of lines that match the pattern.

-E

Enable extended regular expressions

-f

Read the search pattern from a provided file. A file can contain more than one pattern, with each line containing a single pattern.

-i

Ignore character case.

-l

Only print the file name and path where the pattern was found.

-n

Print the line number of the file where the pattern was found.

-P

Enables the Perl regular expression engine.

-R, -r

Recursively search sub-directories.

Command Example

In general, the way grep is used is like this: grep options pattern filenames

To search the /home directory and all sub-directories for files containing the word password irrespective of uppercase/lowercase distinctions:

grep -R -i 'password' /home

grep and egrep

The grep command supports some variations, notably an extended syntax for the regex patterns (we’ll discuss the regex patterns next). There are three different ways to tell grep that you want special meaning on certain characters: 1) by preceding those characters with a backslash; or 2) by telling grep that you want the special syntax (without the need for backslash) by using the -E option when you invoke grep; or 3) by using the command named egrep which is just a script that simply invokes grep as grep -E so you don’t have to.

The only characters that are affected by the extended syntax are: ? + { | ( and ). In the examples that follow we will use grep and egrep interchangeably - they are the same binary underneath. We will choose the one to use that seems most appropriate based on what special characters we need. The special, or meta-, characters are what make grep so powerful. Here is what you need to know about the most powerful and frequently used metacharacters.

Regular Expression Metacharacters

Regular expressions are patterns that are created using a series of characters and metacharacters. Metacharacters such as "?" and "*" have special meaning beyond their literal meaning in regex.

The “.” Metacharacter

In regex, the “.” represents a single wildcard character. It will match on any single character except for a newline. As can be seen in the example below, if we try to match on the pattern T.o the first line of the frost.txt file is returned because it contains the word Two.

$ grep 'T.o' frost.txt

1    Two roads diverged in a yellow wood,

Note that line 5 is not returned even though it contains the word To. This pattern allows any character to appear between the T and o, but as written there must be a character in between. Regex patterns are also case sensitive, which is why line 3 of the file was not returned even though it contains the string too. If you want to treat "." as a period character rather than a wildcard, precede it with a backslash "\." to escape its special meaning.

The “?” Metacharacter

In regex, the “?” character makes any item that precedes it optional; it matches it zero or one time. By adding this metacharacter to the previous example we can see that the output is different.

$ egrep 'T.?o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;

This time we see that both lines 1 and 5 are returned. This is because the metacharacter "." is optional due to the "?" metacharacter that follows it. This pattern will match on any three-character sequence that begins with T and ends with o as well as the two-character sequence To.

Notice that we are using egrep here. We could have used grep -E or we could have used “plain” grep with a slightly different pattern: T.\?o putting the backslash on the question mark to give it the extended meaning.

The “*” Metacharacter

In regex, the "*" is a special character that matches the preceding item zero or more times. It is similar to the "?“, the main difference being that the previous item may appear more than once.

$ grep 'T.*o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The ".*" in the pattern above allows any number of any character to appear in between the T and o. Thus the last line also matches because it contains the pattern The Ro.

The “+” Metacharacter

The "+" metacharacter is the same as the "*" except it requires the preceding item to appear at least once. In other words it matches the preceding item one or more times.

$ egrep 'T.+o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The pattern above specifies one or more of any character to appear in between the T and o. The first line of text matches because of Two - the w is 1 character between the T and the o. The second line doesn’t match the To, as in the previous example; rather, the pattern matches a much larger string — all the way to the o in undergrowth. The last line also matches because it contains the pattern The Ro.

Grouping

We can use parentheses to group together characters. Among other things, this allows us to treat the characters appearing inside the parenthesis as a single item which we can later reference.

$ egrep 'And be one (stranger|traveler), long I stood' frost.txt

3    And be one traveler, long I stood

In the example above we use parenthesis and the Boolean OR operator "|" to create a pattern that will match on line 3. Line 3 as written has the word traveler in it, but this pattern would match even if traveler was replaced by the word stranger.

Brackets and Character Classes

In regex the square brackets, [ ], are used to define character classes and lists of acceptable characters. Using this construct you can list exactly which characters are matched at this position in the pattern. This is particularly useful when trying to perform user input validation. As a shorthand you can specify ranges with a dash such as [a-j]. These ranges are in your locale’s collating sequence and alphabet. For the C locale, the pattern [a-j] will match one of the letters a through j. Table 3-1 provides a list of common examples when using character classes and ranges.

Table 3-1. Regex character ranges
Example Meaning

[abc]

Match only the character a or b or c

[1-5]

Match on digits in the range 1 to 5

[a-zA-Z]

Match any lowercase or uppercase a to z

[0-9+-*/]

Match on numbers or these 4 mathematical symbols

[0-9a-fA-F]

Match a hexadecimal digit

Warning

Be careful when defining a range for digits; the range can at most go from 0 to 9. For example, the pattern [1-475] does not match on numbers between 1 and 475, it matches on any one of the digits (characters) in the range 1-4 or the character 7 or the character 5.

There are also predefined character classes known as shortcuts. These can be used to indicate common character classes such as numbers or letters. See Table 3-2 for a list of shortcuts.

Table 3-2. Regex shortcuts
Shortcut Meaning

\s

Whitespace

\S

Not Whitespace

\d

Digit

\D

Not Digit

\w

Word

\W

Not Word

\x

Hexadecimal Number (e.g. 0x5F)

Note that the above shortcuts are not supported by egrep. In order to use them you must use grep with the -P option. That option enables the Perl regular expression engine to support the shortcuts. For example, to find any numbers in frost.txt:

$ grep -P '\d' frost.txt

1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

There are other character classes (with a more verbose syntax) that are valid only within the bracket syntax, as seen in Table 3-3. They match a single character, so if you need to match many in a row, use the star or plus to get the repetition you need.

Table 3-3. Regex character classes in brackets
Character Class Meaning

[:alnum:]

any alphanumeric character

[:alpha:]

any alphabetic character

[:cntrl:]

any control character

[:digit:]

any digit

[:graph:]

any graphical character

[:lower:]

any lowercase character

[:print:]

any printable character

[:punct:]

any punctuation

[:space:]

any whitespace

[:upper:]

any uppercase character

[:xdigit:]

any hex digit

To use one of these classes it has to be inside the brackets, so you end up with two sets of brackets. For example: grep '[[:cntrl:]]' large.data will look for lines containing control characters (ASCII 0-25). Here is another example:

grep 'X[[:upper:][:digit:]]' idlist.txt

will match any line with an X followed by any uppercase letter or digit. It would match these lines:

User: XTjohnson
an XWing model 7
an X7wing model

They each have an uppercase X followed immediately by either another uppercase letter or by a digit.

Back References

Regex back references are one of the most powerful and often confusing regex operations. Consider the following file, tags.txt:

1    Command
2    <i>line</i>
3    is
4    <div>great</div>
5    <u>!</u>

Suppose you want to write a regular expression that will extract any line that contains a matching pair of complete HTML tags. The start tag has an HTML tag name; the ending tag has the same tag name but with a leading slash. <div> and </div> are a matching pair. You could search for these by writing a lengthy regex that contains all possible HTML tag values, or you can focus on the format of an HTML tag and use a regex back reference.

$ egrep '<([A-Za-z]*)>.*</\1>' tags.txt

2    <i>line</i>
4    <div>great</div>
5    <u>!</u>

In this example, the back reference is the \1 appearing in the latter part of the regular expression. It is referring back to the expression enclosed in first set of parentheses, [A-Za-z]* which has two parts. The letter range in brackets denotes a choice of any letter, uppercase or lowercase. The asterisk (or star) that follows it means to repeat that zero or more times. Therefore the \1 refers to whatever was matched by that pattern in parentheses. If [A-Za-z]* matches div then the \1 also refers to the pattern div.

The overall regular expression, then, can be described as matching a < sign (that literal character is the first one in the regex) followed by zero or more letters then a > sign and then zero or more of any character “.” for any character, “*” for zero or more of the previous item) followed by another < and a slash and then the sequence matched by the expression within the parentheses and finally a > character. If this sequence matches any part of a line from our text file then egrep will print that line out.

You can have more than one back reference in an expression and refer to each with a \1 or \2 or \3 depending on its order in the regular expression. A \1 refers to the first set of parentheses, \2 to the second, and so on. Note that the parentheses are metacharacters - they have a special meaning. If you just want to match a literal parenthesis you need to escape its special meaning by preceding it with a backslash, as in: sin\([0-9.]*\) to match expressions like: sin(6.2) or sin(3.14159).

Note

Valid HTML doesn’t have to be all on one line; the end tag can be several lines away from the start tag. Moreover, some tags can both start and end in a single tag, such as <br/> for a break, or <p/> for an empty paragraph. We would need a more sophisticated approach to include such things in our search.

Quantifiers

Quantifiers specify the number of times an item must appear in a string. Quantifiers are defined by the curly brackets { }. For example, the pattern T{5} means that the letter T must appear consecutively exactly 5 times. The pattern T{3,6} means that the letter T must appear consecutively 3 to 6 times. The pattern T{5,} means that the letter T must appear 5 or more times.

Anchors and Word Boundaries

You can use anchors to specify that a pattern must exist at the beginning or the end of a string. The ^ character is used to anchor a pattern to the beginning of a string. For example ^[1-5] means that a matching string must start with one of the digits 1 through 5 as the first character on the line. The $ character is used to anchor a pattern to the end of a string or line. For example [1-5]$ means that a string must end with one of the digits 1 through 5.

In addition, you can use \b to identify a word boundary (i.e., a space). The pattern \b[1-5]\b will match on any of the digits 1 through 5 where the digit appears as its own word.

Summary

Regular expressions are extremely powerful for describing patterns and can be used in coordination with other tools to search and process data.

The uses and full syntax of regex far exceeds the scope of this book. You can visit the resources below for additional information and utilities related to regex.

In the next chapter we will discuss common data types relevant to security operations and how it can be gathered.

Exercises

  1. Write a regular expression that matches a floating point number (a number with a decimal point) such as 3.14. There can be digits on either side of the decimal point but there need not be any on one side or the other. Allow it to match just a decimal point by itself, too.

  2. Use a back reference in a regular expression to match a number that appears on both sides of an equal sign. For example, it should match “314 is = to 314” but not “6 = 7”

  3. Write a regular expression that looks for a line that begins with a digit and ends with a digit, with anything occurring in between.

  4. Write a regular expression that uses grouping to match on the following 2 IP addresses: 10.0.0.25 and 10.0.0.134.

  5. Write a regular expression that will match if the hexadecimal string 0x90 occurs more than 3 times in a row (i.e. 0x90 0x90 0x90).

Chapter 4. Data Collection

Data is the lifeblood of nearly every defensive security operation. Data tells you the current state of the system, what has happened in the past, and even what might happen in the future. Data is needed for forensic investigations, verifying compliance, and detecting malicious activity. Table 4-1 describes data that is commonly relevant to defensive operations and where it is typically located.

Table 4-1. Data of Interest
Data Data Description Data Location

Log Files

Details on historical system activity and state. Interesting log files include web and DNS server logs, router, firewall, and intrusion detection system logs, and application logs.

In Linux most log files are located in the /var/log directory. In a Windows system logs are found in the Event Log.

Command History

List of recently executed commands

In Linux the location of the history file can be found by executing echo $HISTFILE, and is typically located in the user’s home directory in .bash_history.

Temporary Files

Various user and system files that were recently accessed, saved, or processed

In Windows, temp files can be found in c:\windows\temp and %USERPROFILE%\AppData\Local\. In Linux temp files are typically located in /tmp and /var/tmp. The Linux temporary directory can also be found by using the command echo $TMPDIR.

User Data

Documents, pictures, and other user created files.

User files are typically located in /home/ in Linux and c:\Users\ in Windows.

Browser History

Web pages recently accessed by the user.

Varies widely based on operating system and browser

Windows Registry

Hierarchical database that stores settings and other data that is critical to the operation of Windows and applications

Windows Registry

Throughout this chapter we will explore various methods to gather data, locally and remotely, from both Linux and Windows systems.

Commands in Use

We introduce cut, file, head, and for Windows systems reg and wevtutil, to gather and select data of interest from local and remote systems.

cut

cut is a command used to extract select portions of a file. It reads a supplied input file line-by-line and parses the line based on a specified delimiter. If no delimiter is specified cut will use a TAB character by default. The delimiter characters divide each line of a file into fields. You can use either the field number or character position number to extract parts of the file. Fields and characters start at position 1.

Common Command Options

-c

Specify the character(s) to extract.

-d

Specifies the character used as a field delimiter. By default delimiter is the TAB character.

-f

Specify the field(s) to extract.

Command Example

Example 4-1. cutfile.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

In cutfile.txt each field is delimited using a space. To extract the IP address (field position 2) you can use the following command:

$ cut -d' ' -f2 file1.txt

192.168.10.14
192.168.10.185

The -d’ ' option specifies the space as the field delimiter. The -f2 option tells cut to return the second field, in this case, the IP address.

Warning

The cut command considers each delimiter character as separating a field. It doesn’t collapse white space. Consider the following example:

Pat   25
Pete  12

If we use cut on this file we would define the delimiter to be a space. In the first record there are 3 spaces between the name (Pat) and the number (25). Thus the number is in field #4. However for the next line, the name (Pete) is in field #3, since there are only two space characters between the name and the number. For a data file like this, it would be better to separate the name from the numbers with a single tab character and use that as the delimiter for cut.

file

The file command is used to help identify a given file’s type. This is particularly useful in Linux as most files are not required to have an extension which can be used to identify its type (c.f., .exe in Windows). The file command looks deeper than the filename by reading and analyzing the first block of data, also known as the magic number. Even if you rename a .png image file to end with a .jpg, the file command is smart enough to figure that out and tell you the correct file type (in this case, a PNG image file).

Common Command Options

-f

Read the list of files to analyze from a given file

-k

Do not stop on the first match, list all matches for the file type

-z

Look inside compressed files

Command Example

To identify the file type just pass the filename to the file command.

$ file unknownfile

unknownfile: Microsoft Word 2007+

head

The head command displays the first few lines or bytes of a file. By default head displays the first 10 lines.

Common Command Options

-n

Specify the number of lines to output. To show 15 lines you can specify it as -n 15 or -15.

-c

Specify the number of bytes to output.

reg

The reg command is used to manipulate the Windows Registry and is available in Windows XP and later.

Common Command Parameters

add

Adds an entry to the registry.

export

Copies the specified registry entries to a file.

query

Returns a list of subkeys below the specified path.

Command Example

To list the all of the root keys in the HKEY_LOCAL_MACHINE hive:

$ reg query HKEY_LOCAL_MACHINE

HKEY_LOCAL_MACHINE\BCD00000000
HKEY_LOCAL_MACHINE\HARDWARE
HKEY_LOCAL_MACHINE\SAM
HKEY_LOCAL_MACHINE\SECURITY
HKEY_LOCAL_MACHINE\SOFTWARE
HKEY_LOCAL_MACHINE\SYSTEM

wevtutil

Wevtutil is a command line utility to view and mange system logs in the Windows environment. It is available in most modern versions of Windows and is callable from Git Bash.

Common Command Parameters

el

Enumerate available logs

qe

Query a log’s events

Common Command Options

/c

Specify the maximum number of events to read

/f

Format the output as text or XML

/rd

Read direction, if set to true it will read the most recent logs first

Warning

In the Windows command prompt only a single / is needed before command options. In the Git Bash terminal two // are needed (ex. //c) due to the way commands are processed.

Command Example

To list all of the available logs:

wevtutil el

To view the most recent event in the System log using Git Bash:

wevtutil qe System //c:1 //rd:true
Tip

For additional information see Microsoft’s documentation at https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/wevtutil

Gathering System Information

One of the first steps in defending a system is understanding the state of the system and what it is doing. To accomplish this you need to gather data, either locally or remotely, for analysis.

Executing a Command Remotely Using SSH

The data you want may not always be available locally. You may need to connect to a remote system such as a web, File Transfer Protocol (FTP), or Secure Shell (SSH) server to obtain the desired data.

Commands can be executed remotely and securely using the Secure Shell (SSH) if the remote system is running the SSH service. In its basic form (no options) you can just add ssh and a hostname in front of any shell command to run that command on the specified host. For example, ssh myserver who will run the who command on the remote machine myserver. If you need to specify a different username: ssh username@myserver who or ssh -l username myserver who both do the same thing, just replace username with the username for which you would like to use to login. You can redirect the output to a file on your local system, or to a file on the remote system.

To run a command on a remote system and redirect the output to a file on your local system:

ssh myserver ps > /tmp/ps.out

To run a command on a remote system and redirect the output to a file on the remote system:

ssh myserver ps \> /tmp/ps.out

The backslash will escape the special meaning of the redirect (in the current shell) and simply pass the redirect character as the second word of the three words sent to myserver. When executed on the remote system it will be interpreted by that shell and redirect the output on the remote machine (myserver) and leave it there.

In addition you can take scripts that reside on your local system and run them on a remote system using SSH. To run the osdetect.sh script remotely:

ssh myserver bash < ./osdetect.sh

This runs the bash command on the remote system, but passes into it the lines of the osdetect.sh script directly from your local system. This avoids the need for a two-step process of, first, transferring the script to the remote system and then running that copied script. Output from running the script comes back to your local system and can be captured by re-directing stdout as we have shown with many other commands.

Gathering Linux Log Files

Log files for a Linux system are normally stored in the /var/log/ directory. To easily collect the log files into a single file use the tar command:

tar -czf ${HOSTNAME}_logs.tar.gz /var/log/

The option -c is used to create an archive file, -z to zip the file, and -f to specify a name for the output file. The HOSTNAME variable is a bash variable that is automatically set by the shell to the name of the current host. We include it in our filename so the output file will be given the same name as the system, which will help later with organization if logs are collected from multiple systems. Note that you will need to be logged in as a privileged user or use sudo in order to successfully copy the log files.

Table 4-2 lists some important and common Linux logs and their standard location.

Table 4-2. Linux Log Files
Log Location Description

/var/log/apache2/

Access and error logs for the Apache web server

/var/log/auth.log

Information on user logins, privileged access, and remote authentication

/var/log/kern.log

Kernal logs

/var/log/messages

General non-critical system information

/var/log/syslog

General system logs

To find more information on where log files are being stored for a given system refer to /etc/syslog.conf or /etc/rsyslog.conf on most Linux distributions.

Gathering Windows Log Files

In the Windows environment wevtutil can be used to manipulate and gather log files. Luckily this command is callable from Git Bash. The winlogs.sh script uses the wevtutil el parameter to list all available logs, and then the epl parameter to export each log to a file.

Example 4-2. winlogs.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# winlogs.sh
#
# Description:
# Gather copies of Windows log files
#
# Usage:
# winlogs.sh [-z]
#   -z Tar and zip the output
#

TGZ=0
if (( $# > 0 ))						1
then
    if [[ ${1:0:2} == '-z' ]]				2
    then
	TGZ=1	# tgz flag to tar/zip the log files
    fi
fi
SYSNAM=$(hostname)
LOGDIR=${1:-/tmp/${SYSNAM}_logs}			3

mkdir -p $LOGDIR					4

wevtutil el | while read ALOG				5
do
    ALOG="${ALOG%$'\r'}"				6
    echo "${ALOG}:"					7
    wevtutil epl "$ALOG"  "${LOGDIR}/${SYSNAM}_${ALOG// /_}.evtx"  8
done

if (( TGZ == 1 ))					  9
then
    cd ${LOGDIR} && tar -czvf ${SYSNAM}_logs.tgz *.evtx   10
fi
1

The script begins with a simple initialization and then an if statement, one that checks to see if any arguments were provided to the script. The $# is a special shell variable whose value is the number of arguments supplied on the command line when this script is invoked. This conditional for the if is an arithmetic expression, because of the double parentheses. Therefore the comparison can use the greater-than character > and it will do a numerical comparison. If that symbol is used in an if expression with square brackets rather than double parentheses, the greater-than character > does a comparison of lexical ordering — alphabetical order. You would need to use -gt for a numerical comparison inside square brackets.

For this script the only argument we are supporting is a -z option to indicate that the log files should all be zipped up into a single tar file when its done collecting log files. This also means that we can use a simplistic type of argument parsing. We will use a more sophisticated argument parser (getopts) in an upcoming script.

2

This check takes a substring of the 1st argument ($1) starting at the beginning of the string (an offset of zero bytes), two bytes long. If the argument is, in fact, a -z then we will set a flag. The script also does a shift to remove that argument. What was the second argument, if any, is now the first. The third, if any, becomes the second, and so on.

3

If the user wants to specify a location for the logs it can be specified as an argument to the script. The optional -z argument, if supplied, has already been shift-ed out of the way, so any user supplied path would now be the first argument. If no value was supplied on the command line then the expression inside the braces will return a default value as indicated to the right of the minus sign. We use the braces around SYSTEM because the _logs would otherwise be considered part of the variable name.

4

The -p option to mkdir will create the directory and any intervening directories. It will also not give an error message if the directory exists.

5

Here we invoke wevtutil el to list all the possible log files. The output is piped into a while loop which will read one line, that is, one log filename, at a time.

6

Since this is running on a MSWindows system each line printed by wevtutil will end with both a newline (\n) and a return (\r) character. We remove the character from the right hand side of the string using the % operator. To specify the (non-printing) return character, we use the $'string' construct which substitutes certain backslash-escaped characters with non-printing characters (as defined in the ANSI C standard). So the two characters of \r are replaced with an ASCII 13 character, the return character.

7

We echo the filename to provide an indication to the user of progress being made and which log is currently being fetched.

8

The fourth word on this line is the filename into which we want wevtutil to store the log file it is producing. Since the name of the log as provided may have blanks we replace any blank with an underscore character. While not strictly necessary, it avoids requiring quotes when using the filename. The syntax, in general, is ${VAR/old/new} to retrieve the value of VAR with a substitution: replacing old with new. Using a double slash, ${VAR//old/new} replaces all occurrences, not just the first.

Warning

A common mistake is to type ${VAR/old/new/} but the trailing slash is not part of the syntax and will simply be added to the resulting string if a substitution is made. For example, if VAR=embolden then ${VAR/old/new/} would return embnew/en.

9

This is another arithmetic expression, enclosed in double parentheses. Within those expressions bash doesn’t require the $ in front of most variable names. It would still be needed for positional parameters like $1 to avoid confusion with the integer 1.

10

Here we separate two commands with a double ampersand && which tells the shell to execute the second command only if the first command succeeds. That way the tar doesn’t happen unless the cd is successful.

Gathering System Information

If you are able to arbitrarily execute commands on a system you can use standard OS commands to collect a variety of information about the system. The exact commands you use will vary based on the operating system you are interfacing with. Table 4-3 shows common commands that can yield a great deal of information from a system. Note that the command may be different depending on if it is run within the Linux or Windows environment.

Table 4-3. Local Data Gathering Commands
Linux Command Windows Git Bash Equivilent Purpose

uname -a

uname -a

Operating system version information

cat /proc/cpuinfo

systeminfo

Display system hardware and related info

ifconfig

ipconfig

Network interface information

route

route print

Display routing table

arp -a

arp -a

Display Address Resolution Protocol (ARP) table

netstat -a

netstat -a

Display network connections

mount

net share

Display file systems

ps -e

tasklist

Display running processes

The script getlocal.sh, below, is designed to identify the operating system type using osdetect.sh, run the various commands appropriate for the operating system type, and record the results to a file. The output from each command is stored in Extensible Markup Language (XML) format, i.e., delimited with XML tags, for easier processing later on. Invoke the script like this: bash getlocal.sh < cmds.txt where the file cmds.txt contains a list of commands similar to that shown in Table 4-3. The format it expects are those fields, separated by vertical bars, plus an additional field, the XML tag with which to mark the output of the command. (Also, lines beginning with a # are considered comments and will be ignored.)

Here is what a cmds.txt file might look like:

# Linux Command  |MSWin  Bash |XML tag    |Purpose
#----------------+------------+-----------+------------------------------
uname -a         |uname -a    |uname      |O.S. version etc
cat /proc/cpuinfo|systeminfo  |sysinfo    |system hardware and related info
ifconfig         |ipconfig    |nwinterface|Network interface information
route            |route print |nwroute    |routing table
arp -a           |arp -a      |nwarp      |ARP table
netstat -a       |netstat -a  |netstat    |network connections
mount            |net share   |diskinfo   |mounted disks
ps -e            |tasklist    |processes  |running processes

Here is the source for the script.

Example 4-3. getlocal.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# getlocal.sh
#
# Description:
# Gathers general system information and dumps it to a file
#
# Usage:
# bash getlocal.sh < cmds.txt
#   cmds.txt is a file with list of commands to run
#

# SepCmds - separate the commands from the line of input
function SepCmds()
{
      LCMD=${ALINE%%|*}                   11
      REST=${ALINE#*|}                    12
      WCMD=${REST%%|*}                    13
      REST=${REST#*|}
      TAG=${REST%%|*}                     14

      if [[ $OSTYPE == "MSWin" ]]
      then
         CMD="$WCMD"
      else
         CMD="$LCMD"
      fi
}

function DumpInfo ()
{                                                              5
    printf '<systeminfo host="%s" type="%s"' "$HOSTNAME" "$OSTYPE"
    printf ' date="%s" time="%s">\n' "$(date '+%F')" "$(date '+%T')"
    readarray CMDS                           6
    for ALINE in "${CMDS[@]}"                7
    do
       # ignore comments
       if [[ ${ALINE:0:1} == '#' ]] ; then continue ; fi     8

      SepCmds

      if [[ ${CMD:0:3} == N/A ]]             9
      then
          continue
      else
          printf "<%s>\n" $TAG               10
          $CMD
          printf "</%s>\n" $TAG
      fi
    done
    printf "</systeminfo>\n"
}

OSTYPE=$(./osdetect.sh)                     1
HOSTNM=$(hostname)                          2
TMPFILE="${HOSTNM}.info"                    3

# gather the info into the tmp file; errors, too
DumpInfo  > $TMPFILE  2>&1                  4
1

After the two function definitions the script begins here, invoking our osdetect.sh script (from a previous chapter). We’ve specified the current directory as its location. You could put it elsewhere but then be sure to change the specified path from ./ to wherever you put it and/or add that location to your PATH variable.

Note

To make things more efficient you can include the code from osdetect.sh directly in getlocal.sh.

2

Next we run the hostname program in a subshell to retrieve the name of this system for use in the next line but also later in the DumpInfo function.

3

We use the hostname as part of the temporary filename where we will put all our output.

4

Here is where we invoke the function that will do most of the work of this script. We redirect both stdout and stderr (to the same file) when invoking the function so that the function doesn’t have to put redirects on any of its output statements; it can write to stdout and this invocation will redirect all the output as needed. Another way to do this would have been to put the redirect on the closing brace of the DumpInfo function definition. Redirecting stdout might instead be left to the user who invokes this script; it would simply write to stdout by default. But if the user wants the output in a file, the user has to create a tempfile name and has to remember to redirect stderr as well. Our approach is suitable for a less experienced user.

5

Here is where the “guts” of the script begins. This function begins with some output of an XML tag called <systeminfo> which will have it’s closing tag written out at the end of this function.

6

The readarray command in bash will read all the lines of input (until end-of-file or on keyboard input until control-D). Each line will be its own entry in the array named, in this case, CMDS.

7

This for loop will loop over the values of the CMDS array, that is, over each line, one at a time.

8

This line uses the substring operation to take the character at position 0, of length 1, from the variable ALINE. The hashtag (or pound sign) is in quotes so that the shell doesn’t interpret it as the start of the script’s own comment.

If the line is not a comment, the script will call the SepCmds function. More about that function later; it separates the line of input into CMD and TAG, where CMD will the appropriate command for a Linux or MSWindows system depending on where we run the script.

9

Here again we use the substring operation from the start of the string (position 0) of length 3 to look for the string that indications that there is no appropriate operation on this particular operating system for the desired information. The continue statement tells bash to skip to the next iteration of the loop.

10

If we do have an appropriate action to take, this section of code will print the specified XML tag on either side of the invocation of the specified command. Notice that we just invoke the command by retrieving the value of the variable CMD.

11

Here we isolate the Linux command from a line of our input file by removing all the characters to the right of the vertical bar, including the bar itself. The %% says to make the longest match possible on the right side of the variable’s value and remove it from the value it returns (i.e., ALINE isn’t changed).

12

Here the # removes the shortest match and from the left hand side of the variable’s value. Thus, it removes the Linux command that was just put in LCMD.

13

Again we remove everything to the right of the vertical bar but this time we are working with REST, modified in the previous statement. This gives us the MSWindows command.

14

Here we extract the XML tag using the same substitution operations we’ve seen twice already.

All that’s left in this function is the decision, based on the operating system type, as to which value to return as the value in CMD. All variables are “global” unless explicitly declared as local within a function. None of ours are local, so they can be used (set, changed, or used) throughout the script.

When running this script you can use the cmds.txt file as shown or change its values to get whatever set of information you want to collect. You can also run it without redirecting the input from a file; simply type (or copy/paste) the input once the script is invoked.

Gathering the Windows Registry

The Windows Registry is a vast repository of settings that define how the system and applications will behave. Specific registry key values can often be used to identify the presence of malware and other intrusions. Because of that a copy of the registry is useful when later performing analysis of the system.

To export the entire Windows Registry to a file:

regedit //E ${HOSTNAME}_reg.bak

Note that two forward-slashes are used before the E option because we are calling regedit from Git Bash, only one would be needed if using the Windows command prompt. We use ${HOSTNAME} as part of the output file name to make it easier to organize later on.

If needed, the reg command can also be used to export sections of the registry or individual subkeys. To export the HKEY_LOCAL_MACHINE hive:

reg export HKEY_LOCAL_MACHINE $(uname -n)_hklm.bak

Searching the File System

The ability to search the system is critical for everything from organizing files, to incident response, to forensic investigation. The find and grep commands are extremely powerful and can be used to perform a variety of search functions.

Searching by Filename

Searching by filename is one of the most basic search methods. This is useful if the exact filename is known, or a portion of the filename is known. To search the /home directory and subdirectories for filenames containing the word password:

find /home -name '*password*'

Note the use of the * character at the beginning and end of the search string designates a wildcard, meaning it will match any (or no) characters. This is a shell pattern and is not the same as a regular expression. Additionally you can use the -iname option instead of -name to make the search case-insensitive.

Tip

If you want to suppress errors, such as Permission Denied, when using find you can do so by redirecting stderr to /dev/null or to a log file.

find /home -name '*password*' 2>/dev/null

Searching for Hidden Files

Hidden files are often interesting as they can be used by people or malware looking to avoid detection. In Linux, names of hidden files begin with a period. To find hidden files in the /home directory and subdirectories:

find /home -name '.*'
Tip

The .* in the example above is a shell pattern which is not the same as a regular expression. In the context of find the pattern provided will match on any file that begins with a period and is followed by any number of additional characters (denoted by the * wildcard character).

In Windows, hidden files are designated by a file attribute, not the filename. From the Windows command prompt you can identify hidden files on the c:\ drive by:

dir c:\ /S /A:H

The /S option tells dir to recursively traverse subdirectories and the /A:H displays files with the hidden attribute. Unfortunately Git Bash intercepts the dir command and instead executes ls, which means it cannot easily be run from bash. This can be solved by using the find command’s -exec option coupled with the Windows attrib command.

The find command has the ability run a specified command for each file that is found. To do that you can use the exec option after specifying your search criteria. Exec replaces any curly brackets ({}) with the pathname of the file that was found. The semicolon terminates the command expression.

$ find /c -exec attrib '{}' \; | egrep '^.{4}H.*'

A   H                C:\Users\Bob\scripts\hist.txt
A   HR               C:\Users\Bob\scripts\winlogs.sh

The find command will execute the Windows attrib command for each file it identifies on the c:\ drive (denoted as /c), thereby printing out each file’s attributes. The egrep command is then used with a regular expression to identify lines where the 5th character is the letter H, which will be true if the file’s hidden attribute is set.

If you want to clean up the output further and only display the file path you can do so by piping the output of egrep into the cut command.

$ find . -exec attrib '{}' \; | egrep '^.{4}H.*' | cut -c22-

C:\Users\Bob\scripts\hist.txt
C:\Users\Bob\scripts\winlogs.sh

The -c option tells cut to use character position numbers for slicing. 22- tells cut to begin at character 22, which is the beginning of the file path, and continue to the end of the line (-). This can be useful if you want to pipe the file path into another command for further processing.

Searching by File Size

The find command’s -size option can be used to find files based on file size. This can be useful to help identify unusually large files, or to identify the largest or smallest files on a system.

To search for files greater than 5 GB in size in the /home directory and subdirectories:

find /home -size +5G

To identify the largest files in the system you can combine find with a few other commands:

find / -type f -exec ls -s '{}' \; | sort -n -r | head -5

First we use find / -type f to list all of the files in and under the root directory. Each file is passed to ls -s which will identify its size in blocks (not bytes). The list is then sorted from highest to lowest, and the top five are displayed using head. To see the smallest files in the system tail can be used in place of head, or you can remove the reverse (-r) option from sort.

Tip

In the shell you can use !! to represent the last command that was executed. You can use this to execute a command again, or include it in a series of piped commands. For example, suppose you just ran the following command:

find / -type f -exec ls -s '{}' \;

You can then use !! to run that command again or feed it into a pipeline.

!! | sort -n -r | head -5

The shell will automatically replace !! with the last command that was executed.

Give it a try!

You can also use the ls command directly to find the largest file and completely eliminate the usage of find, which, is significantly more efficient. To do that just add the -R option for ls which will cause it to recursively list the files under the specified directory.

ls / -R -s | sort -n -r | head -5

Searching by Time

The file system can also be searched based on when files were last accessed or modified. This can be useful when investigating incidents to identify recent system activity. It can also be useful for malware analysis to identify files that have been accessed or modified during program execution.

To search for files in the /home directory and subdirectories modified less than 5 minutes ago:

find /home -mmin -5

To search for files modified less than 24 hours ago:

find /home -mtime -1

The number specified with the mtime option is a multiple of 24 hours, so 1 means 24 hours, 2 means 48 hours, etc. A negative number here means “less than” the number specified, a positive number means “greater than”, and an unsigned number means “exactly”.

To search for files modified more than 2 days, i.e., 48 hours, ago:

find /home -mtime +2

To search for files accessed less than 24 hours ago use the -atime option:

find /home -atime -1

To search for files in the /home directory accessed less than 24 hours ago and copy (cp) each file to the current working directory (./):

find /home -type f -atime -1 -exec cp '{}' ./ \;

The use of -type f tells find to match only ordinary files, ignoring directories and other special file types. You may also copy the files to any directory of your choosing by replacing the ./ with an absolute or relative path.

Warning

Warning: Be sure that your current working directory is not somewhere in the /home hierarchy or you will have the copies found and thus copied again.

Searching for Content

The grep command can be used to search for content inside of files. To search for files in the /home directory and subdirectories that contain the string password:

grep -r -i /home -e 'password'

The -r option recursively searches all directories below /home, -i specifies a case-insensitive search, and -e specifies the regex pattern string to search for.

Tip

The -n option can be used identify which line in the file the search string is found and -w can be used to only match whole words.

You can combine grep with find to easily copy matching files to your current working directory (or any specified directory):

find /home -type f -exec grep '{}' -e 'password' \; -exec cp '{}' ./ \;

First we use find /home/ -type f to identify all of the files in and below the /home directory. Each file found is passed to grep to search for password within its content. Each file matching the grep criteria is then passed to the cp command to copy the file to the current working directory (./). This combination of commands may take a considerable amount of time to execute and is a good candidate to run as a background task.

Searching by File Type

Searching a system for specific file types can be challenging. You cannot rely on the file extension, if one even exists, as that can be manipulated by the user. Thankfully the file command can help identify types by comparing the contents of a file to known patterns called Magic Numbers. Table 4-4 lists common Magic Numbers and their starting location inside of files.

Table 4-4. Magic Numbers
File Type Magic Number Pattern (Hex) Magic Number Pattern (Hex) File Offset (Bytes)

Jpeg

FF D8 FF DB

ÿØÿÛ

0

DOS Executable

4D 5A

MZ

0

Executable and Linkable Format

7F 45 4C 46

.ELF

0

Zip File

50 4B 03 04

PK..

0

To begin you need to identify the type of file for which you want to search. Lets assume you want to find all of the PNG image files on the system. First you would take a known good file such as Title.png, run it through the file command, and examine the output.

$ file Title.png

Title.png: PNG image data, 366 x 84, 8-bit/color RGBA, non-interlaced

As expected file identifies the known good Title.png file as PNG image data and also provides the dimensions and various other attributes. Based on this information you need to determine what part of the file command output to use for the search, and generate the appropriate regular expression. In many cases, such as with forensic discovery, you are likely better off gathering more information than less; you can always further filter the data later. To do that you will use a very broad regular expression that will simply search for the word PNG in the output from the file command.

.*PNG.*

You can of course make more advanced regular expressions to identify specific files. For example, if you wanted to find PNG files that have a dimension of 100 x 100:

.*PNG.* 100 x 100.*

If you want to find PNG and JPEG files:

.*(PNG|JPEG).*

Once you have the regular expression you can write a script to run the file command against every file on the system looking for a match. When a match is found typesearch.sh will print the file path to standard out.

Example 4-4. typesearch.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# typesearch.sh
#
# Description:
# Search the file system for a given file type. It prints out the
# pathname when found.
#
# Usage:
# typesearch.sh [-c dir] [-i] [-R|r] <pattern> <path>
#   -c Copy files found to dir
#   -i Ignore case
#   -R|r Recursively search subdirectories
#   <pattern> File type pattern to search for
#   <path> Path to start search
#

DEEPORNOT="-maxdepth 1"		# just the current dir; default

# PARSE option arguments:
while getopts 'c:irR' opt; do                         1
  case "${opt}" in                                    2
    c) # copy found files to specified directory
	       COPY=YES
	       DESTDIR="$OPTARG"                             3
	       ;;
    i) # ignore u/l case differences in search
	       CASEMATCH='-i'
	       ;;
    [Rr]) # recursive                                 4
        unset DEEPORNOT;;                             5
    *)  # unknown/unsupported option                  6
        # error mesg will come from getopts, so just exit
        exit 2 ;;
  esac
done
shift $((OPTIND - 1))                                 7


PATTERN=${1:-PDF document}                            8
STARTDIR=${2:-.}	# by default start here

find $STARTDIR $DEEPORNOT -type f | while read FN     9
do
    file $FN | grep -q $CASEMATCH "$PATTERN"          10
    if (( $? == 0 ))   # found one                    11
    then
	        echo $FN
	        if [[ $COPY ]]                               12
	        then
	            cp -p $FN $DESTDIR                       13
	        fi
    fi
done
1

This script supports options which alter its behavior, as described in the opening comments of the script. The script needs to parse these options to tell which ones have been provided and which are omitted. For anything more than a single option or two it makes sense to use the getopts shell built-in. With the while loop we will keep calling getopts until it returns a non-zero value, telling us that there are no more options. The options we want to look for are provided in that string c:irR. Whichever option is found is returned in, opt, the variable name we supplied.

2

We are using a case statement here which is a multi-way branch; it will take the branch that matches the pattern provided before the left parenthesis. We could have used an if/elif/else construct but this reads well and makes the options so clearly visible.

3

The c option has a : after it in the list of supported options which indicates to getopts that the user will also supply an argument for that option. For this script that optional argument is the directory into which copies will be made. When getopts parses an option with an argument like this it puts the argument in the variable named OPTARG and we save it in DESTDIR because another call to getopts may change OPTARG.

4

The script supports either a upper case R or lower case r for this option. Case statements specify a pattern to be matched, not just a simple literal, so we wrote [Rr]) for this case, using the brackets construct to indicate that either letter is considered a match.

5

The other options set variables to cause their action to occur. In this case we unset the previously set variable. When that variable is referenced later as $DEEPORNOT it will have no value so it will effectively disappear from the command line where it is used.

6

Here is another pattern, the asterisk, which matches anything. If no other pattern has been matched, this case will be executed. It is, in effect, an “else” clause for the case statement.

7

When we’re done parsing the options we can get rid of the ones we’ve already processed with a shift. Just a single shift gets rid of a single argument so that the second argument because the first, the third became the second, and so on. Specifying a number like shift 5 will get rid of the first 5 arguments so that $6 becomes $1, $7 becomes $2, and so on. Calls to getopts keep track of which arguments to process in the shell variable OPTIND. It refers to the next argument to be processed. By shifting by this amount we get rid of any/all of the options that we parsed. After this shift $1 will refer to the first non-option argument, whether or not any options were supplied when the user invoked the script.

8

The two possible arguments that aren’t -option format are the pattern we’re searching for and the directory where we want to start our search. When we refer to a bash variable we can add a :- to say “if that value is empty or unset then return this default value instead”. We give a default value for PATTERN as PDF document and the default for STARTDIR is . which refers to the current directory.

9

We invoke the find command telling it to start its search in $STARTDIR. Remember that $DEEPORNOT may be unset and thus add nothing to the command line or it may be the default -maxdepth 1 telling find not to go any deeper than this directory. We’ve added a -type f so that we only find plain files (not directories or special device files or FIFOs). That isn’t strictly necessary and you could remove it if you want to be able to search for those kinds of files. The names of the files found are piped in to the while loop which will read them one at a time into the variable FN.

10

The -q option to grep tells it to be quiet and not output anything. We don’t need to see what phrase it found, only that it found it.

11

The $? construct is the value returned by the previous command. A successful result means that grep found the pattern supplied.

12

This checks to see if COPY has a value. If it is null the if will be false.

13

The -p option to the cp command will preserve the mode, ownership and timestamps of the file, in case that information is important to your analysis.

If you are looking for a lighter weight but less capable solution you can perform a similar search using the find command’s exec option as seen in the example below.

find / -type f -exec file '{}' \; | egrep '.*PNG.*' | cut -d' ' -f1

Here we send each item found by the find command into file to identify its type. We then pipe the output of file into egrep and filter it looking for the PNG keyword. The use of cut is simply to clean up the output and make it more readable.

Warning

Be cautious if using the file command on an untrusted system. The file command uses the magic pattern file located at /usr/share/misc/. A malicious user could modify this file such that certain file types would not be identified. A better option is to mount the suspect drive to a known-good system and search from there.

Searching by Message Digest Value

A cryptographic hash function is a one-way function that transforms an input message of arbitrary length into a fixed length message digest. Common hash algorithms include MD5, SHA-1, and SHA-256. Take the following two files:

Example 4-5. hashfilea.txt
This is hash file A
Example 4-6. hashfileb.txt
This is hash file B

Notice that the files are identical except for the last letter in the sentence. You can use the sha1sum command to compute the SHA-1 message digest of each file.

$ sha1sum hashfilea.txt hashfileb.txt

6a07fe595f9b5b717ed7daf97b360ab231e7bbe8 *hashfilea.txt
2959e3362166c89b38d900661f5265226331782b *hashfileb..txt

Even though there was only a small difference between the two files they generated completely different message digests. Had the files been the same the message digests would have also been the same. You can use this property of hashing to search the system for a specific file if you know its digest. The advantage is that the search will not be influenced by the filename, location, or any other attributes; the disadvantage is that the files need to be exactly the same, if the file contents have changed in any way the search will fail.

Example 4-7. hashsearch.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# hashsearch.sh
#
# Description:
# Recursively search a given directory for a file that
# matches a given SHA-1 hash
#
# Usage:
# hashsearch.sh <hash> <directory>
#   hash - SHA-1 hash value to file to find
#   directory - Top directory to start search
#

HASH=$1
DIR=${2:-.}	# default is here, cwd

# convert pathname into an absolute path
function mkabspath ()				6
{
    if [[ $1 == /* ]]				7
    then
    	ABS=$1
    else
    	ABS="$PWD/$1"				8
    fi
}

find $DIR -type f |				1
while read fn
do
    THISONE=$(sha1sum "$fn")			2
    THISONE=${THISONE%% *}			3
    if [[ $THISONE == $HASH ]]
    then
	mkabspath "$fn"				4
	echo $ABS				5
    fi
done
1

We’ll look for any plain file for our hash. We need to avoid special files - reading a FIFO would cause our program to hang as it waited for someone to write into the FIFO. Reading a block special or character special file would also not be a good idea. The -type f assures that we only get plain files. It prints those filenames, one per line, to stdout which we redirect via a pipe into the while read commands.

2

This computes the hash value in a subshell and captures its output (i.e., whatever it writes to stdout) and assigns it to the variable. The quotes are needed in case the filename has spaces in its name.

3

This reassignment removes from the right hand side the largest substring beginning with a space. The output from sha1sum is both the computed hash and the filename. We only want the hash value, so we remove the filename with this substitution.

4

We call the mkabspath function putting the filename in quotes. The quotes make sure that the entire filename shows up as a single argument to the function, even if the filename has one or more spaces in the name.

5

Remember that shell variables are global unless declared to be local within a function. Therefore the value of ABS that was set in the call to mkabspath is available to us here.

6

This is our declaration of the function. When declaring a function you can omit either the keyword function or the parentheses but not both.

7

For the comparison we are using shell pattern matching on the right hand side. This will check to see if the first parameter begins with a slash. If it does, then this is already an absolute pathname and we need do nothing further.

8

When the parameter is only a relative path, it is relative to the current location so we pre-pend the current working directory thereby making it absolute. The variable PWD is a shell variable that is set to the current directory via the cd command.

Transferring Data

Once you have gathered all of the desired data, the next step is to move it off of the origin system for further analysis. To do that you can copy the data to a removable device or upload it to a centralized server. If you are going to upload the data be sure to do so using a secure method such as Secure Copy (SCP). The example below uses scp to upload the file some_system.tar.gz to the home directory of user bob on remote system 10.0.0.45.

scp some_system.tar.gz bob@10.0.0.45:/home/bob/some_system.tar.gz

For convenience you can add a line at the end of your collection scripts to automatically use scp to upload data to a specified host. Remember to give your files unique names as to not overwrite existing files and also make analysis easier later on.

Warning

Be cautious of how you perform SSH or SCP authentication within scripts. It is not recommended that you include passwords in your scripts. The preferred method is to use SSH certificates. The keys and certificates can be generated using the ssh-keygen command.

Summary

Gathering data is an important step in defensive security operations. When collecting data be sure to transfer and store it using secure methods (i.e. encrypted). As a general rule, gather all data that you think is relevant; you can easily delete data later, but you cannot analyze data you did not collect. Before collecting data, first confirm you have permission and/or legal authority to do so.

Also be aware that when dealing with adversaries, they will often try to hide their presence by deleting or obfuscating data. To counter that be sure to use multiple methods when searching for files (name, hash, contents, etc).

In the next chapter we will explore techniques for processing data and preparing it for analysis.

Exercises

  1. Write the command to search the file system for any file named dog.png.

  2. Write the command to search the file system for any file containing the text confidential.

  3. Write the command to search the file system for any file containing the text secret or confidential and copy the file to your current working directory.

  4. Write the command to execute ls -R / on the remote system 192.168.10.32 and write the output to a file named filelist.txt on your local system.

  5. Modify getlocal.sh to automatically upload the results to a specified server using SCP.

  6. Modify hashsearch.sh to have an option (-1) to quit after finding a match. If the option is not specified, it will keep searching for additional matches.

  7. Modify hashsearch.sh to simplify the full pathname that it prints out.

    1. If the string it output was /home/usr07/subdir/./misc/x.data modify it to remove the redundant ./ before printing it out.

    2. If the string was /home/usr/07/subdir/../misc/x.data modify it to remove the ../ and also the subdir/ before printing it out.

  8. Modify winlogs.sh to indicate its progress by printing the logfile name over the top of the previous logfile name. (Hint: use a return character rather than a newline)

  9. Modify winlogs.sh to show a simple progress bar of plus signs building from left to right. Use a separate invocation of wevtutil el to get the count of the number of logs and scale this to, say, a width of 60.

Chapter 5. Data Processing

In the previous chapter you gathered lots of data. Likely that data is in a variety of formats including free-form text, comma separated values (CSV), and Extensible Markup Language (XML). In this chapter we show you how to parse and manipulate that data so you can extract key elements for analysis.

Commands in Use

We introduce awk, join, sed, tail, and tr to prepare data for analysis.

awk

Awk is not just a command, but actually a programming language designed for processing text. There are entire books dedicated to this subject. Awk will be explained in more detail throughout this book, but here we provide just a brief example of its usage.

Common Command Options

-f

Read in the awk program from a specified file

Command Example

Take the file awkusers.txt:

Example 5-1. awkusers.txt
Mike Jones
John Smith
Kathy Jones
Jane Kennedy
Tim Scott

You can use awk to print each line where the user’s last name is Jones.

$ awk '$2 == "Jones" {print $0}' awkusers.txt

Mike Jones
Kathy Jones

Awk will iterate through each line of the input file reading in each word (separated by whitespace by default) into fields. Field $0 represents the entire line, $1 the first word, $2 the second word, etc. An awk program consists of patterns and corresponding code to be executed when that pattern is matched. In this example there is only one pattern. We test $2 to see if that field is equal to Jones. If it is, awk will run the code in the braces which, in this case, will print the entire line.

Note

If we left off the explicit comparison and instead wrote awk ' /Jones/ {print $0}' then the string inside the slashes is a regular expression to match anywhere in the input line. It would print all the names as before, but it would also find lines where Jones might be the first name or part of a longer name (such as “Jonestown”).

join

Join combines the lines of two files that share a common field. In order for join to function properly the input files must be sorted.

Common Command Options

-j

Join using the specified field number. Fields start at 1.

-t

Specify the character to use as the field separator. Space is the default field separator.

--header

Use the first line of each file as a header.

Command Example

Take the following files:

Example 5-2. usernames.txt
1,jdoe
2,puser
3,jsmith
Example 5-3. accesstime.txt
0745,file1.txt,1
0830,file4.txt,2
0830,file5.txt,3

Both files share a common field of data, which is the user ID. In accesstime.txt the user ID is in the third column. In usernames.txt the user ID is in the first column. You can merge these two files using join as follows:

$ join -1 3 -2 1 -t, accesstime.txt usernames.txt

1,0745,file1.txt,jdoe
2,0830,file4.txt,puser
3,0830,file5.txt,jsmith

The -1 3 option tells join to use the third column in the first file (accesstime.txt), and -2 1 specifies the first column in the second file (usernames.txt) for use when merging the files. The -t, option specifies the comma character as the field delimiter.

sed

Sed allows you to perform edits, such as replacing characters, on a stream of data.

Common Command Options

-i

Edit the specified file and overwrite in place

Command Example

The sed command is quite powerful and can be used for a variety of functions, however, replacing characters or sequences of characters is one of the most common. Take the file ips.txt:

Example 5-4. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

You can use sed to replace all of the instances of the 10.0.4.35 IP address with 10.0.4.27.

$ sed 's/10\.0\.4\.35/10.0.4.27/g' ips.txt

ip,OS
10.0.4.2,Windows 8
10.0.4.27,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

In this example, sed uses the following format with each component separated by a forward slash:

s/<regular expression>/<replace with>/<flags/

The first part of the command (s) tells sed to substitute. The second part of the command (10\.0\.4\.35) is a regular expression pattern. The third part (10.0.4.27) is the value to use to replace the regex pattern matches. The forth part is optional flags, which in this case (g, for global) tells sed to replace all instances on a line (not just the first) that match the regex pattern.

tail

The tail command is used to output the last lines of a file. By default tail will output the last 10 lines of a file.

Common Command Options

-f

Continuously monitor the file and output lines as they are added

-n

Output the number lines specified

Command Example

To output the last line in the somefile.txt file:

$ tail -n 1 somefile.txt

12/30/2017 192.168.10.185 login.html

tr

The tr command is used to translate or map from one character to another. It is also often used to delete unwanted or extraneous characters. It only reads from stdin and writes to stdout so you typically see it with redirects for the input and output files.

Common Command Options

-d

delete the specified characters from the input stream

-s

squeeze, that is, replace repeated instances of a character with a single instance

Command Example

You can translate all the backslashes into forward slashes and all the colons to vertical bars with the tr command:

tr '\\:'  '/|' < infile.txt  > outfile.txt

If the contents of infile.txt looked like this:

drive:path\name
c:\Users\Default\file.txt

then after running the tr command, outfile.txt would contain this:

drive|path/name
c|/Users/Default/file.txt

The characters from the first argument are mapped to the corresponding characters in the second argument. Two backslashes are needed to specify a single backslash character because the backslash has a special meaning to tr; it is used to indicate special characters line newline \n or return \r or tab \t. You use the single quotes around the arguments to avoid any special interpretation by bash.

Tip

Files from Windows systems often come with both a Carriage Return and a Line Feed (CR & LF) character at the end of each line. Linux and macOS systems will have only the newline character to end a line. If you transfer a file to Linux and want to get rid of those extra return characters, here is how you might do that with the tr command:

tr -d '\r' < fileWind.txt  > fileFixed.txt

Conversely, you can convert Linux line endings to Windows line endings using sed:

$ sed -i 's/$/\r/' fileLinux.txt

The -i option makes the changes in place and writes them back to the input file.

Processing Delimited Files

Many of the files you will collect and process are likely to contain text, which makes the ability to manipulate text from the command line a critical skill. Text files are often broken into fields using a delimiter such as a space, tab, or comma. One of the more common formats is known as Comma Separated Values (CSV). As the name indicates, CSV files are delimited using commas, and fields may or may not be surrounded in double quotes ("). The first line of a CSV file is often the field headers. Here is an example:

Example 5-5. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4

To extract just the name from the file you can use cut by specifying the field delimiter as a comma and the field number you would like returned.

$ cut -d',' -f1 csvex.txt

"name"
"John Smith"
"Jane Smith"
"Bill Jones"

Note that the field values are still enclosed in double quotations. This may not be desirable for certain applications. To remove the quotations you can simply pipe the output into tr with its -d option.

$ cut -d',' -f1 csvex.txt | tr -d '"'

name
John Smith
Jane Smith
Bill Jones

You can further process the data by removing the field header using the tail command’s -n option.

$ cut -d',' -f1 csvex.txt | tr -d '"' | tail -n +2

John Smith
Jane Smith
Bill Jones

The -n +2 option tells tail to output the contents of the file starting at line number 2, thus removing the field header.

Tip

You can also give cut a list of fields to extract, such as -f1-3 to extract fields 1 through 3, or a list such as -f1,4 to extract fields 1 and 4.

Iterating Through Delimited Data

While you can use cut to extract entire columns of data, there are instances where you will want to process the file and extract fields line-by-line; in this case you are better off using awk.

Let’s suppose you want to check each user’s password hash in csvex.txt against the dictionary file of known passwords passwords.txt.

Example 5-6. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4
Example 5-7. passwords.txt
password,md5hash
123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
welcome,40be4e59b9a2a2b5dffb918c0e86b3d7
ninja,3899dcbab79f92af727c2190bbd8abc5
abc123,e99a18c428cb38d5f260853678922e03
123456789,25f9e794323b453885f5181f1b624d0b
12345678,25d55ad283aa400af464c76d713c07ad
sunshine,0571749e2ac330a7455809c6b0e7af90
princess,8afa847f50a716e64932d995c8e7435a
qwerty,d8578edf8458ce06fbc5bb76a58c5c

You can extract each user’s hash from csvex.txt using awk as follows:

$ awk -F "," '{print $4}' csvex.txt

"password hash"
5f4dcc3b5aa765d61d8327deb882cf99
e10adc3949ba59abbe56e057f20f883e
d8578edf8458ce06fbc5bb76a58c5ca4

By default awk uses the space character as a field delimiter, so the -F option is used to identify a custom field delimiter (,) and then print out the forth field ($4) which is the password hash. You can then use grep to take the output from awk one line at a time and search for it in the passwords.txt dictionary file, outputting any matches.

$ grep "$(awk -F "," '{print $4}' csvex.txt)" passwords.txt

123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
qwerty,d8578edf8458ce06fbc5bb76a58c5ca4

Processing by Character Position

If a file has fixed-width field sizes you can use the cut command’s -c option to extract data by character position. In csvex.txt the (U.S. 10-digit) phone number is an example of a fixed-width field.

$ cut -d',' -f3 csvex.txt | cut -c2-13 | tail -n +2

555-555-1212
555-555-1234
555-555-6789

Here you first use cut in delimited mode to extract the phone number at field 3. Since each phone number is the same number of characters you can use the cut character position option (-c) to extract the characters in between the quotations. Finally, tail is used to remove the file header.

Processing XML

Extensible Markup Language (XML) allows you to arbitrarily create tags and elements that describe data. Below is an example XML document.

Example 5-8. book.xml
<book title="Rapid Cybersecurity Ops" edition="1">
  <author>
    <firstName>Paul</firstName>
    <lastName>Troncone</lastName>
  </author>
  <author>
    <firstName>Carl</firstName>
    <lastName>Albing</lastName>
  </author>
</book>
1

This is a start tag that contains two attributes, also known as name/value pairs. Attribute values must always be quoted.

2

This is a start tag.

3

This is an element that has content.

4

This is an end tag.

For useful processing, you must be able to search through the XML and extract data from within the tags, which can be done using grep. Lets find all of the firstName elements. The -o option is used so only the text that matches the regex pattern will be returned, rather than the entire line.

$ grep -o '<firstName>.*<\/firstName>' book.xml

<firstName>Paul</firstName>
<firstName>Carl</firstName>

Note that the regex pattern above will only find the XML element if the start and end tags are on the same line. To find the pattern across multiple lines you need to make use of two special features. First, add the -z option to grep, which treats newlines like any ordinary character in its searching and adds a null (ASCII 0) at the end of each string it finds. Then add the -P option and (?s) to the regex pattern, which is a Perl-specific pattern match modifier. It modifies the . metacharacter to also match on the newline character.

$ grep -Pzo '(?s)<author>.*?<\/author>' book.xml

<author>
  <firstName>Paul</firstName>
  <lastName>Troncone</lastName>
</author><author>
  <firstName>Carl</firstName>
  <lastName>Albing</lastName>
</author>
Warning

The -P option is not available for all versions of grep including those included with macOS.

To strip the XML start and end tags and extract the content you can pipe your output into sed.

$ grep -Po '<firstName>.*?<\/firstName>' book.xml | sed 's/<[^>]*>//g'

Paul
Carl

The sed expression can be described as s/expr/other/ to replace (or substitute) some expression (expr) with something else (other). The expression can be just literal characters or a more complex regex. If an expression has no “other” portion, such as s/expr// then it replaces anything that matches the regular expression with nothing, essentially removing it. The regex pattern we use in the above example, namely the <[^>]*> expression, is a little confusing, so lets break it down.

< - The pattern begins with a literal less-than character <

[^>]* - Zero or more (indicated by the asterisk) characters from the set of characters inside the brackets; the first character is a ^ which means “not” any of the remaining characters listed. Here that’s just the solitary greater-than character, so [^>] matches any character that is not >

> - The pattern ends with a literal >

This should match a single XML tag, from its opening less-than to its closing greater-than character, but not more than that.

Processing JSON

JavaScript Object Notation (JSON) is another popular file format, particularly for exchanging data through Application Programming Interfaces (APIs). JSON is a simple format that consists of objects, arrays, and name/value pairs. Here is a sample JSON file:

Example 5-9. book.json
{
  "title": "Rapid Cybersecurity Ops",
  "edition": 1,
  "authors": [
    {
      "firstName": "Paul",
      "lastName": "Troncone"
    },
    {
      "firstName": "Carl",
      "lastName": "Albing"
    }
  ]
}
1

This is an object. Objects begin with { and end with }.

2

This is a name/value pair. Values can be a string, number, array, boolean, or null.

3

This is an array. Arrays begin with [ and end with ].

Tip

For more information on the JSON format visit http://json.org/

When processing JSON you are likely going to want to extract key/value pairs. To do that you can use grep. Lets extract the firstName key/value pair from book.json.

$ grep -o '"firstName": ".*"' book.json

"firstName": "Paul"
"firstName": "Carl"

Again, the -o option is used to return only the characters that match the pattern rather than the entire line of the file.

If you want to remove the key and only display the value you can do so by piping the output into cut, extracting the second field, and removing the quotations with tr.

$ grep -o '"firstName": ".*"' book.json | cut -d " " -f2 | tr -d '\"'

Paul
Carl

We will perform more advanced processing of JSON in a later chapter.

Aggregating Data

Data is often collected from a variety of sources, and in a variety of files and formats. Before you can analyze the data you must get it all into the same place and in a format that is conducive to analysis.

Suppose you want to search a treasure trove of data files for any system named ProductionWebServer. Recall that in previous scripts we wrapped our collected data in XML tags with the following format: '<systeminfo host="">. During collection we also named our files using the host name. You can now use either of those attributes to find and aggregate the data into a single location.

find /data -type f -exec grep '{}' -e 'ProductionWebServer' \;
-exec cat '{}' >> ProductionWebServerAgg.txt \;

The command find /data -type f lists all of the files in the /data directory and its subdirectories. For each file found, it runs grep looking for the string ProductionWebServer. If found, the file is appended (>>) to the file ProductionWebServerAgg.txt. Replace the cat command with cp and a directory location if you would rather copy all of the files to a single location rather than to a single file.

You can also use the join command to take data that is spread across two files and aggregate it into one. Take the two files seen in Example 5-10 and Example 5-11.

Example 5-10. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
Example 5-11. user.txt
user,ip
jdoe,10.0.4.2
jsmith,10.0.4.35
msmith,10.0.4.107
tjones,10.0.4.145

The files share a common column of data, which is the IP addresses. Because of that the files can be merged using join.

$ join -t, -2 2 ips.txt user.txt

ip,OS,user
10.0.4.2,Windows 8,jdoe
10.0.4.35,Ubuntu 16,jsmith
10.0.4.107,macOS,msmith
10.0.4.145,macOS,tjones

The -t, option tells join that the columns are delimited using a comma, by default it uses a space character.

The -2 2 option tells join to use the second column of data in the second file (user.txt) as the key to perform the merge. By default join uses the first field as the key, which is appropriate for the first file (ips.txt). If you needed to join using a different field in ips.txt you would just add the option -1 n where n is replaced by the appropriate column number.

Warning

In order to use join both files must already be sorted by the column you will use to perform the merge. To do this you can use the sort command which is covered in Chapter 6.

Summary

In this chapter we explored ways to process common data formats including delimited, positional, JSON, and XML. The vast majority of data you collect and process will be in one of those formats.

In the next chapter we will look at how data can be analyzed and transformed into information that will provide insights into system status and drive decision making.

Exercises

  1. Given the file tasks.txt below, use the cut command to extract columns 1 (Image Name), 2 (PID), and 5 (Mem Usage).

    Image Name;PID;Session Name;Session#;Mem Usage
    System Idle Process;0;Services;0;4 K
    System;4;Services;0;2,140 K
    smss.exe;340;Services;0;1,060 K
    csrss.exe;528;Services;0;4,756 K
  2. Given the file procowner.txt below, use the join command to merge the file with tasks.txt.

    Process Owner;PID
    jdoe;0
    tjones;4
    jsmith;340
    msmith;528
  3. Use the tr command to replace all of the semicolon characters in tasks.txt with the tab character and print it to the screen.

  4. Write a command that extracts the first and last names of all of the authors in book.json.

Chapter 6. Data Analysis

In the previous chapters we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data it often helps to start broad and continually narrow the search as new insights are gained into the data.

In this chapter we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.

We will use an Apache web server access log for for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical log entry can be seen below. The full log file will be referenced as access.log in this book and can be downloaded at https://www.rapidcyberops.com.

Example 6-1. Sample from access.log
192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200
7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:56.0)
Gecko/20100101 Firefox/56.0"
Note

Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.

The Apache web server log fields are broken out in Table 6-1.

Table 6-1. Apache Web Server Combined Log Format Fields
Field Description Field Number

192.168.0.11

IP address of the host that requested the page

1

-

RFC 1413 Ident protocol identifier (- if not present)

2

-

The HTTP authenticated user ID (- if not present)

3

[12/Nov/2017:15:54:39 -0500]

Date, time, and GMT offset (timezone)

4 - 5

GET /request-quote.html

The page that was requested

6 - 7

HTTP/1.1

The HTTP protocol version

8

200

The status code returned by the web server

9

7326

The size of the file returned in bytes

10

http://192.168.0.35/support.html

The referring page

11

Mozilla/5.0 (Windows NT 6.3; Win64…

User agent identifying the browser

12+

Note that there is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See https://httpd.apache.org/docs/2.4/logs.html for additional information on the Apache log format and configuration.

The Hypertext Transfer Protocol (HTTP) status codes mentioned above are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 6-2:

Table 6-2. HTTP Status Codes
Code Description

200

OK

401

Unauthorized

404

Page Not Found

500

Internal Server Error

502

Bad Gateway

Tip

For a complete list of codes see the Hypertext Transfer Protocol (HTTP) Status Code Registry at https://www.iana.org/assignments/http-status-codes

Commands in use

We introduce sort, head, and uniq to limit the data we need to process and display. The following file will be used for command examples:

Example 6-2. file1.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

sort

The sort command is used to rearrange a text file into numerical and alphabetical order. By default sort will arrange lines in ascending order starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letter unless otherwise specified.

Common Command Options

-r

Sort in descending order

-f

Ignore case

-n

Use numerical ordering, so that 1,2,3 all sort before 10. (in the default alphabetic sorting, 2 and 3 would appear after 10.

-k

Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.

-o

Write output to a specified file.

Command Example

To sort file1.txt by the file name column and ignore the IP address column you would use the following:

sort -k 2 file1.txt

You can also sort on a subset of the field. To sort by the 2nd octet in the IP address:

sort -k 1.5,1.7 file1.txt

This will sort using characters 5 through 7 of the first field.

uniq

The uniq command filters out duplicate lines of data that occur adjacent to one another. To remove all duplicate lines in a file be sure to sort it before using uniq.

Common Command Options

-c

Print out the number of times a line is repeated.

-f

Ignore the specified number of fields before comparing. For example, -f 3 will ignore the first three fields in each line. Fields are delimited using spaces.

-i

Ignore letter case. By default uniq is case-sensitive.

Sorting and Arranging Data

When analyzing data for the first time it is often beneficial to start by looking at the extremes; the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data you can collect from web server log files. An unusually high number of page accesses could indicate scanning activity or a denial of service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.

To do that you can use the sort, head, and tail commands at the end of a pipeline such as:

…   | sort -k 2.1 -rn | head -15

which pipes the output of a script into the sort command and then pipes that sorted output into head that will print the top 15 (in this case) lines. The sort command here is using as its sort key (-k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse sort (-r) and the values will be sorted like numbers (-n). Why a numerical sort? so that 2 shows up between 1 and 3 and not between 19 and 20 (which is alphabetical order).

By using head we take the first lines of the output. We could get the last few lines by piping the output from the sort command into tail instead of head. Using tail -15 would give us the last 15 lines. The other way to do this would be to simply remove the -r option on sort so that it does an ascending rather than descending sort.

Counting Occurrences in Data

A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from you can gain a better understanding of general site activity. Interesting entries can include:

  • A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.

  • A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.

  • A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.

To detect this type of activity we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this we will use the cut command to extract the field and then pipe the output into our new tool countem.sh.

Example 6-3. countem.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# countem.sh
#
# Description:
# Count the number of instances of an item using bash
#
# Usage:
# countem.sh < inputfile
#

declare -A cnt        # assoc. array             1
while read id xtra                               2
do
    let cnt[$id]++                               3
done
# now display what we counted
# for each key in the (key, value) assoc. array
for id in "${!cnt[@]}"                           4
do
    printf '%d %s\n'  "${cnt[$id]}"  "$id"       5
done

And here is another version, this time using awk:

Example 6-4. countem.awk
# Rapid Cybersecurity Ops
# countem.awk
#
# Description:
# Count the number of instances of an item using awk
#
# Usage:
# countem.awk < inputfile
#

awk '{ cnt[$1]++ }
END { for (id in cnt) {
        printf "%d %s\n", cnt[id], id
      }
    }'
1

Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array, declared here with the -A option, so that we can use whatever string we read as our index.

The associative array feature of bash found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you’re using something older than bash 4.0, Example 6-4 is an alternate script that uses awk instead.

The array references are like others in bash, using the ${var[index]} syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}

2

While we only expect one word of input per line, we put the variable xtra there to capture any other words that appear on the line. Each variable on a read command gets a assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable get the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than their are variables on the read command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra but if there are no extra words then xtra will be given the value of the null string (which won’t matter either way because we don’t use it.)

3

Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.

4

This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed -it has to do with the hashing algorithm for the index values, so it is not guaranteed to be in any order such as alphabetical order.

5

In printing out the value and key we put the values inside quotes so that we always get a single value for each argument - even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.

Both will work nicely in a pipeline of commands like this:

cut -d' ' -f1 logfile | bash countem.sh

or (see note 2 above) just:

bash countem.sh < logfile

For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (page not found) error:

$ awk '$9 == 404 {print $1}' access.log | bash countem.sh

1 192.168.0.36
2 192.168.0.37
1 192.168.0.11

You can also use grep 404 access.log and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g. the byte count, or part of a file path). The use of awk here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.

To begin analysis of the example access.log file you can start by looking at the hosts that accessed the web server. You can use the Linux cut command to extract the first field of the log file, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is seen below.

$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26
Tip

If you do not have countem.sh available you can use the uniq command -c option to achieve similar results, but it will require an extra pass through the data using sort to work properly.

$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26

Next, you can further investigate by looking at the host that had the most number of requests, which as can be seen above is IP address 192.168.0.37 with 111. You can use awk to filter on the IP address, then pipe that into cut to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488

The activity of this particular host is unimpressive, appearing to be standard web browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png

This output indicates that host 192.168.0.36 accessed nearly every page on the website exactly one time. This type of activity often indicates webcrawler or site cloning activity. If you take a look at the user agent string provided by the client it further verifies this conclusion.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq

"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

The user agent identifies itself as HTTrack, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.

Tip

You can find additional information on HTTrack at http://www.httrack.com.

Totaling Numbers in Data

Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address - or which IP addresses have requested and received the most data?

The solution is not that much different than countem.sh - you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data).

The pipeline to invoke this will now extract two fields from the logfile, the first and the last.

cut -d' ' -f 1,10 access.log | bash summer.sh
Example 6-5. summer.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# summer.sh
#
# Description:
# Sum the total of field 2 values for each unique field 1
#
# Usage:
# Input Format - <input field> <number>
#

declare -A cnt        # assoc. array
while read id count
do
  let cnt[$id]+=$count
done
for id in "${!cnt[@]}"
do
    printf "%-15s %8d\n"  "${id}"  "${cnt[${id}]}" 1
done
1

Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left justified (via the minus sign) and 8 digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.

You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this use cut to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

192.168.0.36     4371198
192.168.0.37     2575030
192.168.0.11     2537662
192.168.0.14     2876088
192.168.0.26      665693

These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host the next step would be to review the specific pages and files accessed by the suspicious host to try and classify it as malicious or benign.

Displaying Data in a Histogram

You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.

The script to do the printing will take the first field as the index to an associative array; the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the list.

Example 6-6. histogram.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram.sh
#
# Description:
# Generate a horizontal bar chart of specified data
#
# Usage:
# Data input format - <label> <value>
#

function pr_bar ()                            1
{
    local -i i raw maxraw scaled              2
    raw=$1
    maxraw=$2
    ((scaled=(MAXBAR*raw)/maxraw))            3
    # min size guarantee
    ((raw > 0 && scaled == 0)) && scaled=1				4

    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '\n'

} # pr_bar

#
# "main"
#
declare -A RA						5
declare -i MAXBAR max
max=0
MAXBAR=50	# how large the largest bar should be

while read labl val
do
    let RA[$labl]=$val					6
    # keep the largest value; for scaling
    (( val > max )) && max=$val
done

# scale and print it
for labl in "${!RA[@]}"					7
do
    printf '%-20.20s  ' "$labl"
    pr_bar ${RA[$labl]} $max				8
done
1

We define a function to draw a single bar of the histogram. This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script. We will be reusing this function in a future script so we could have put it in a separate file and included it here with a source command - but we didn’t.

2

We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere). We declare all these variables as integer (that’s the -i option) because we are only going to compute values with them and not use them as strings.

3

The computation is done inside double-parentheses and inside those we don’t need to use the $ to indicate “the value of” each variable name.

4

This is an “if-less” if statement. If the expression inside the double-parentheses is true then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled is never zero when the raw value is non-zero. Why? Because we’d like something to show up in that case.

5

The main part of the script begins with a declaration of the RA array as an associative array.

6

Here we reference the associative array using the label, a string, as its index.

7

Since the array isn’t index by numbers, we can’t just count integers and use them as indices. This contruct gives all the various strings that were used as an index to the array, one at a time, in the for loop.

8

We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar function.

Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort, or you could take a slightly different approach.

Here’s a version of the histogram script that preserves order - by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown as the function pr_bar remains the same.

Example 6-7. histogram_plain.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram_plain.sh
#
# Description:
# Generate a horizontal bar chart of specified data without
# using associative arrays, good for older versions of bash
#
# Usage:
# Data input format - <label> <value>
#

declare -a RA_key RA_val                                 1
declare -i max ndx
max=0
maxbar=50    # how large the largest bar should be

ndx=0
while read labl val
do
    RA_key[$ndx]=$labl                                   2
    RA_value[$ndx]=$val
    # keep the largest value; for scaling
    (( val > max )) && max=$val
    let ndx++
done

# scale and print it
for ((j=0; j<ndx; j++))                                  3
do
    printf "%-20.20s  " ${RA_key[$j]}
    pr_bar ${RA_value[$j]} $max
done

This version of the script avoids the use of associative arrays - in case you are running an older version of bash (prior to 4.x), such as on MacOS systems. For this version we use two separate arrays, one for the index value and one for the counts. Since they are normal arrays we have to use an integer index and so we will keep a simple count in the variable ndx.

1

Here the variable names are declared as arrays. The lower-case a says that they are arrays, but not of the “associative” variety. While not strictly necessary, it is good practice.

2

The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle” - that is, easily broken, if changes to the script ever got the two arrays out of sync.

3

Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to ndx. The variable j is used here so as not to interfere with the index in the for looop inside pr_bar although we were careful enough inside the function to declare its version of i as local to the function. Do you trust it? Change the j to an i here and see if it still works (It does). Then try removing the local declaration and see if it fails (It does).

This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data you can retrieve them in order they were read in - in the numerical order of the index.

You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh and then into histogram.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh

192.168.0.36          ##################################################
192.168.0.37          #############################
192.168.0.11          #############################
192.168.0.14          ################################
192.168.0.26          #######

While this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.

In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that you can use the summer.sh script, but due to the format of the access.log file you need to do a little more processing before you can pipe it into the script. If you use cut to extract the date/time and bytes transferred fields you are left with data that causes some problems for the script.

$ cut -d' ' -f4,10 access.log

[12/Nov/2017:15:52:59 2377
[12/Nov/2017:15:52:59 4529
[12/Nov/2017:15:52:59 1112

As seen in the output above, the raw data starts with a [ character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that you can use an additional iteration of the cut command to remove the character using -c2- as an option. This option tells cut to extract the data by character starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed can be seen below.

$ cut -d' ' -f4,10 access.log | cut -c2-

12/Nov/2017:15:52:59 2377
12/Nov/2017:15:52:59 4529
12/Nov/2017:15:52:59 1112
Tip

Alternatively, you can use tr in place of the second cut. The -d option will delete the character specified, in this case the square bracket.

cut -d' ' -f4,10 access.log | tr -d '['

You also need to determine how you want to group the time-bound data; by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut iteration. The table below illustrates the cut option to use to extract various forms of the date/time field. Note that these cut options are specific to Apache log files.

Table 6-3. Apache Log Date/Time Field Extraction
Date/Time Extracted Example Output Cut Opton

Entire date/time

12/Nov/2017:19:26:09

-c2-

Month, Day, and Year

-c2-12

Month and year

Nov/2017

-c5-12,22-

Full Time

19:26:04

-c14-

Hour

19

-c14-15,22-

Year

The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is only accessed during working hours of 9:00 AM to 5:00 PM, you can review the server log file on a daily basis using the histogram view and see if there are any spikes in activity outside of normal working hours. Large spikes of activity or data transfer outside of normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected you can filter the data by that particular date and time and review the page accesses to determine if the activity is malicious.

For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis you can do the following:

$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 |
cut -c14-15,22- | bash summer.sh | bash histogram.sh

17              ##
16              ###########
15              ############
19              ##
18              ##################################################

Here the access.log file is sent through awk to extract the entries from a particular date. Note the use of the like operator (~) in stead of == since field 4 also contains time information. Those entries are piped into cut to extract the date/time and bytes transferred fields, and then piped into cut again to extract just the hour. From there it is summed by hour using summer.sh and converted into a histogram using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.

Finding Uniqueness in Data

Previously IP address 192.168.0.37 was identified as the system that had the largest number of page requests. The next logical question is what pages did this system request? With that answer you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that you can use awk and cut and pipe the output into countem.sh.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 |
bash countem.sh | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

While this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and only requires a single pass over the data.

Example 6-8. pagereq.sh
# Rapid Cybersecurity Ops
# pagereq.sh
#
# Description:
# Count the number of page requests for a given IP address using bash
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

declare -A cnt                                             1
while read addr d1 d2 datim gmtoff getr page therest
do
    if [[ $1 == $addr ]] ; then let cnt[$page]+=1 ; fi
done
for id in ${!cnt[@]}                                       2
do
    printf "%8d %s\n" ${cnt[$id]} $id
done
1

We declare cnt as an associative array (also known as a hash table or dictionary) so that we can use a string as the index to the array. In this program we will be using the page address (the URL) as the index.

2

The ${!cnt[@]} results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.

Early versions of bash don’t have associative arrays. You can use awk to do the same thing - counting the various page requests from a particular ip address - since awk has associative arrays.

Example 6-9. pagereq.awk
# Rapid Cybersecurity Ops
# pagereq.awk
#
# Description:
# Count the number of page requests for a given IP address using awk
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

# count the number of page requests from an address ($1)
awk -v page="$1" '{ if ($1==page) {cnt[$7]+=1 } }                1
END { for (id in cnt) {                                          2
    printf "%8d %s\n", cnt[id], id
    }
}'
1

There are two very different $1 variables on this line. The first $1 is a shell variable and refers to the first argument supplied to this script when it is invoked. The second $1 is an awk variable. It refers to the first field of the input on each line. The first $1 has been assigned to the awk variable page so that it can be compared to each $1 of awk - that is, to each first field of the input data.

2

This simple syntax results in the varialbe id iterating over the values of the index values to the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with the same effect.

You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input.

$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

Identifying Anomalies in Data

On the web a User Agent String is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user agent string:

Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

This user agent string identifies the system as: Windows NT version 6.3 (aka Windows 8.1); 64-bit architecture; and using the Firefox browser.

The user agent string is interesting for a few reasons: first because of the significant amount of information it conveys, which, can be used to identify the types of systems and browsers accessing the server; second because it is configurable by the end user, which, can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e. a webcrawler).

You can identify unusual user agents by first compiling a list of known good user agents. For the purposes of this exercise we will use a very small list that is not specific to a particular version.

Example 6-10. useragents.txt
Firefox
Chrome
Safari
Edge
Tip

For a list of common user agent strings visit https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found it should be considered an anomaly and printed to standard out along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.

Example 6-11. useragents.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# useragents.sh
#
# Description:
# Read through a log looking for unknown user agents
#
# Usage:
# useragents.txt < <inputfile>
#   <inputfile> Apache access log
#


# mismatch - search through the array of known names
#  returns 1 (false) if it finds a match
#  returns 0 (true) if there is no match
function mismatch ()                                    1
{
    local -i i                                          2
    for ((i=0; i<$KNSIZE; i++))
    do
        [[ "$1" =~ .*${KNOWN[$i]}.* ]] && return 1      3
    done
    return 0
}

# read up the known ones
readarray -t KNOWN < "useragents.txt"                      4
KNSIZE=${#KNOWN[@]}                                     5

# preprocess logfile (stdin) to pick out ipaddr and user agent
awk -F'"' '{print $1, $6}' | \
while read ipaddr dash1 dash2 dtstamp delta useragent   6
do
    if mismatch "$useragent"
    then
        echo "anomaly: $ipaddr $useragent"
    fi
done
1

We will use a function for the core of this script. It will return a success (or “true”) if it finds a mismatch, that is, if it finds no match against the list of known user agents. This logic may seem a bit inverted, but it makes the if statement containing the call to mismatch read clearly.

2

Declaring our for loop index as a local variable is good practice. It’s not strictly necessary in this script but is a good habit.

3

There are two strings to compare - the input from the logfile and a line from the list of known user agents. To make for a very flexible comparison we use the regex comparison operator (the =~). The .* (meaning “zero or more instances of any character”) placed on either side of the $KNOWN array reference means that the known string can appear anywhere within the other string for a match.

4

Each line of the file is added as an element to the array name specified. This gives us an array of known user agents. There are two identical ways to do this in bash either readarray, as used here, or mapfile. The -t option removes the trailing newline from each line read. The file containing the list of known user agents is specified here; modify as needed.

5

This computes the size of the array. It is used inside the mismatch function to loop through the array. We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.

6

The input string is a complex mix of words and quote marks. To capture the user agent string we use the double-quote as the field separator. Doing that, however, means that our first field contains more than just the ip address. By using the bash read we can parse on the spaces to get the ip address. The last argument of the read takes all the remaining words and so it can capture all the several words of the user agent string.

Summary

In this chapter we looked at techniques to analyze the content of log files by identifying unusual and anomolous activity. This type of analysis can provide you with insights into what occurred in the past. In the next chapter we will look at how to analyze log files and other data to provide insights into what is happening in the system in real time.

Exercises

  1. The example use of summer.sh used cut to print the 1st and 10th fields of the access.log file, like this:

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

Replace the cut command by using the awk command. Do you get the same results? What might be different about those two approaches?

  1. Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:

    192.168.0.37          #############################    2575030
    192.168.0.26          ####### 665693
  2. Expand the histogram.sh script to allow the user to supply the option -s that specifies the maximum bar size. For example histogram.sh -s 25 would limit the maximum bar size to 25 # characters. The default should remain at 50 if no option is given.

  3. Download the following web log file TODO: Add Log File URL.

    1. Which IP address made the most number of requests?

    2. Which page was accessed the most number of times?

  4. Download the following Domain Name System (DNS) server log TODO: Add Log File URL

    1. What was the most requested domain?

    2. What day had the most number of requests?

  5. Modify the useragents.sh script to add some parameters

    1. Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts as it currently is used.

    2. Add code for a -f option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.

  6. Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the ip address into a 10-12 digit number for that use. Caution: don’t have leading zeros on the number or the shell will attempt to interpret it as an octal number. Example: convert “10.124.16.3” into “10124016003” which can be used as a numerical index.

Chapter 7. Real-Time Log Monitoring

The ability to analyze a log after an event is an important skill. It is equally important to be able to extract information from a log file in real-time to detect malicious or suspicious activity as it happens. In this chapter we will explore methods to read in log entries as they are generated, format them for output to the analyst, and generate alerts based on known indicators of compromise.

Monitoring Text Logs

The most basic method to monitor a log in real time is to use the tail command’s -f option, which continuously reads a file and outputs new lines to stdout as they are added. As in previous chapters, we will use an Apache web server access log for examples, but the techniques presented can be applied to any text-based log. To monitor the Apache access log with tail:

tail -f /var/logs/apache2/access.log

Commands can be combined to provide more advanced functionality. The output from tail can be piped into grep so only entries matching specific criteria will be output. The example below monitors the Apache access log and outputs entries matching a particular IP address.

tail -f /var/logs/apache2/access.log | grep '10.0.0.152'

Regular expressions can also be used. Below only entries returning a HTTP status code of 404 Page Not Found will be displayed. The -i option is added to ignore character case.

tail -f /var/logs/apache2/access.log | egrep -i 'HTTP/.*" 404'

To clean up the output it can be piped into the cut command to remove extraneous information. The example below monitors the access log for requests resulting in a 404 status code and then uses cut to only display the date/time and the page that was requested.

$ tail -f access.log | egrep --line-buffered 'HTTP/.*" 404' | cut -d' ' -f4-7

[29/Jul/2018:13:10:05 -0400] "GET /test
[29/Jul/2018:13:16:17 -0400] "GET /test.txt
[29/Jul/2018:13:17:37 -0400] "GET /favicon.ico

You can further clean the output by piping it into tr -d '[]"' to remove the square brackets and the orphen double-quotation.

Note that we used the egrep command’s --line-buffered option. This forces egrep to output to stdout each time a line break occurs. Without this option buffering occurs and output is not piped into cut until a buffer is filled. We don’t want to wait that long. The option will have egrep write out each line as it finds it.

Log-Based Intrusion Detection

You can use the power of tail and egrep to monitor a log and output any entries that match known patterns of suspicious or malicious activity, often referred to as Indicators of Compromise (IOC). By doing this you can create a lightweight Intrusion Detection System (IDS). To begin lets create a file that contains regex patterns for IOCs.

Example 7-1. ioc.txt
\.\./ 1
etc/passwd 2
etc/shadow
cmd\.exe 3
/bin/sh
/bin/bash
1

This pattern (../) is an indicator of a directory traversal attack where the attacker tries to escape from the current working directory and access files for which they otherwise would not have permission.

2

The Linux etc/passwd and etc/shadow files are used for system authentication and should never be available through the web server.

3

Serving the cmd.exe, /bin/sh, or /bin/bash files is an indicator of a reverse shell being returned by the web server. A reverse shell is often an indicator of a successful exploitation attempt.

Note that the IOCs must be in a regular expression format as they will be used later with egrep.

Tip

IOCs for web servers are too numerous to discuss here in depth. For more examples of indicators of compromise download the latest Snort community ruleset at https://www.snort.org/downloads.

Next ioc.txt can be used with the egrep -f option. This option tells egrep to read in the regex patterns to search for from the specified file. This allows you to use tail to monitor the log file, and as each entry is added it will be compared against all of the patterns in the IOC file, outputting any entry that matches. Here is an example:

tail -f /var/logs/apache2/access.log | egrep -i -f ioc.txt

Additionally, the tee command can be used to simultaneously display the alerts to the screen and save them to their own file for later processing.

tail -f /var/logs/apache2/access.log | egrep --line-buffered -i -f ioc.txt |
tee -a interesting.txt

Again the --line-buffered option is used to ensure there are no problems caused by command output buffering.

Monitoring Windows Logs

As previously discussed, you need to use the wevtutil command to access Windows events. While the command is very versatile, it does not have functionality similar to tail that can be used to extract new entries as they occur. Thankfully, a simple bash script can provide similar functionality.

Example 7-2. wintail.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# wintail.sh
#
# Description:
# Perform a tail-like function on a Windows log
#

WINLOG="Application"  1

LASTLOG=$(wevtutil qe "$WINLOG" //c:1 //rd:true //f:text)  2

while true
do
	CURRENTLOG=$(wevtutil qe "$WINLOG" //c:1 //rd:true //f:text)  3
	if [[ "$CURRENTLOG" != "$LASTLOG" ]]
	then
		echo "$CURRENTLOG"
		echo "----------------------------------"
		LASTLOG="$CURRENTLOG"
	fi
done
1

This variable identifies the Windows log you want to monitor. You can use wevtutil el to obtain a list of logs currently available on the system.

2

This executes the wevtutil command to query the specified log file. The c:1 parameter causes it to return only one log entry. The rd:true parameter causes the command to read the most recent log entry. Finally, f:text returns the result as plain text rather than XML which makes it easy to read from the screen.

3

The next few lines execute the wevtutil command again and compare the latest log entry to the last one printed to the screen. If the two are different, meaning that a new entry was added to the log, it prints the entry to the screen. If they are the same nothing happens and it loops back and checks again.

Generating a Real-Time Histogram

A tail -f provides an ongoing stream of data. What if you wanted to count how many lines are added to a file during a time interval? You could observe that stream of data, start a timer, and begin counting until a specified time interval is up; then you can stop counting and report the results.

You might divide this work into two separate processes - two separate scripts - one to count the lines and another to watch the clock. The timekeeper will notify the line counter by means of a standard POSIX inter-process communication mechanism called a “signal”. A signal is a software interrupt and there are different kinds of such interrupts. Some are fatal, that is, will cause the process to terminate (e.g., a floating point exception). Most can be ignored or caught - and an action taken when the signal is caught. Many have a predefined purpose, used by the operating system. We’ll use one of the two signals available for users, SIGUSR1. (The other is SIGUSR2.)

Shell scripts can catch the catchable interrupts with the trap command, a shell built-in command. With trap you specify a command to indicate what action you want taken and a list of signals which trigger the invocation of that command. For example:

trap warnmsg SIGINT

will cause the command warnmsg (our own script or function) to be called whenever the shell script receives a SIGINT signal, as when you type a control-C to interrupt a running process.

Here is the script that performs the count.

Example 7-3. looper.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# looper.sh
#
# Description:
# Count the lines in a file being tailed -f
# Report the count interval on every SIGUSR1
#


function interval ()					1
{
    echo $(date '+%y%m%d %H%M%S') $cnt			2
    cnt=0
}

declare -i cnt=0
trap interval SIGUSR1					3

shopt -s lastpipe					4

tail -f --pid=$$ ${1:-log.file} | while read aline	5
do
    let cnt++
done
1

The function interval will be called on each signal. We define it here. It needs to be defined before we can call it, of course, but also before we can use it in our trap statement, below.

2

The date command is called to provide a timestamp for the count value that we print out. After we print the count we reset its value to 0 to start the count for the next interval.

3

Now that interval is defined, we can tell bash to call the function whenever our process receives a SIGUSR1 signal.

4

This is a crucial step. Normally when there is a pipeline of commands (such as ls -l | grep rwx | wc) then those pieces of the pipeline (each command) are run in subshells and they each end up with their own process id. This would be a problem for this script because the while loop would be in a subshell, with a different process id. Whatever process started the looper.sh script wouldn’t know the process id of the while loop to send the signal to it. Moreover, changing the value of the cnt variable in the subshell doesn’t change the value of cnt in the main process, so a signal to the main process would result in a value of zero every time. The solution is this shopt command that sets (-s) the shell option lastpipe. That option tells the shell not to create a subshell for the last command in a pipeline but to run that command in the same process as the script itself. In our case that means that the tail will run in a subshell (i.e., a different process) but the while loop will be part of the main script process. Caution: this shell option is only available in bash 4.x and above, and is only for non-interactive shells (i.e., scripts).

5

Here is the tail -f command with one more option, the --pid option. We specify a process id to tell tail to exit when that process dies. We are specifying $$, the current shell script’s process id, as the one to watch. This is useful for cleanup so that we don’t get tail commands left running in the background (if, for example, this script is run in the background; see the next script which does just that.)

The script tailcount.sh starts and stops the counting, the script that has the “stopwatch” so to speak, and times these intervals.

Example 7-4. tailcount.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# tailcount.sh
#
# Description:
# Count lines every n seconds
#

# cleanup - the other processes on exit
function cleanup ()
{
  [[ -n $LOPID ]] && kill $LOPID		1
}

trap cleanup EXIT 				2

bash looper.sh $1 &				3
LOPID=$!					4
# give it a chance to start up
sleep 3

while true
do
    kill -SIGUSR1 $LOPID
    sleep 5
done >&2					5
1

Since this script will be starting other processes (other scripts) it should clean up after itself. If the process id has been stored in LOPID the variable will be non-empty and therefore the function will send a signal via the kill command to that process. By not specifying a particular signal on the kill command, the default signal to be sent is SIGTERM.

2

Not a signal, EXIT is a special case for the trap statement to tell the shell to call this function (here, cleanup) when the shell that is running this script is about to exit.

3

Now the real work begins. The looper.sh script is called but is put in the “background”, that is, detached from the keyboard to run on its own while this script continues (without waiting for looper.sh to finish).

4

This saves the process id of the script that we just put in the background.

5

This redirection is just a precaution. By redirecting stdout into stderr then any and all output coming from the while loop or the kill or sleep statements (though we’re not expecting any) will be sent to stderr and not get mixed in with any output coming from looper.sh which, though it is in the background, still writes to stdout.

In summary, looper.sh has been put in the background and its process id saved in a shell variable. Every 5 seconds this script (tailcount.sh) sends that process (which is running looper.sh) a SIGUSR1 signal which causes looper.sh to print out its current count and restart its counting. When tailcount.sh exits it will clean up by sending a SIGTERM to the looper.sh function so that it, too, will be terminated.

With both a script to do the counting and a script to drive it with its “stopwatch”, you can use their output as input to a script that prints out a histogram-like bar to represent the count. It is invoked as follows:

bash tailcount.sh | bash livebar.sh

The livebar.sh script reads from stdin and prints its output to stdout, one line for each line of input.

Example 7-5. livebar.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# livebar.sh
#
# Description:
# Creates a rolling horizontal bar chart of live data
#
# Usage:
# <output> | bash livebar.sh
#

function pr_bar ()					1
{
    local raw maxraw scaled
    raw=$1
    maxraw=$2
    ((scaled=(maxbar*raw)/maxraw))
    ((scaled == 0)) && scaled=1		# min size guarantee
    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '\n'

} # pr_bar


maxbar=60   # largest no. of chars in a bar		2
MAX=60
while read dayst timst qty
do
    if (( qty > MAX ))					3
    then
	let MAX=$qty+$qty/4	# allow some room
	echo "              **** rescaling: MAX=$MAX"
    fi
    printf '%6.6s %6.6s %4d:' $dayst $timst $qty	4
    pr_bar $qty $MAX
done
1

The pr_bar function prints the bar of hashtags scaled to the maximum size based on the parameters supplied. This function might look familiar. We’re using the same function we used in histogram.sh in the previous chapter.

2

This is the longest string of hastags we will allow on a line (to avoid line wrap).

3

How large will the values be that need to be displayed? Not knowing before hand (although it could be supplied as an argument to the script) the script will, instead, keep track of a maximum. If that maximum is exceeded it will “rescale” and the current and future lines will be scaled to the new maximum. The script adds 25% onto the maximum so that it doesn’t need to rescale if each new value goes up by just one or two each time.

4

The printf specifies a min and max width on the first two fields that are printed. They are date and time stamps and will be truncated if they exceed those widths. You wouldn’t want the count truncated so we specify it will be 4 digits wide but the entire value will be printed regardless. If it is smaller than 4 it will be padded with blanks.

Since this script reads from stdin you can run it by itself to see how it behaves. Here’s a sample:

$ bash livebar.sh
201010 1020 20
201010   1020   20:####################
201010 1020 70
              **** rescaling: MAX=87
201010   1020   70:################################################
201010 1020 75
201010   1020   75:###################################################
^C

In this example the input is mixing with the output. You could also put the input into a file and redirect it into the script to see just the output.

$ bash livebar.sh < testdata.txt
bash livebar.sh < x.data
201010   1020   20:####################
              **** rescaling: MAX=87
201010   1020   70:################################################
201010   1020   75:###################################################
$

Summary

Log files can provide tremendous insight into the operation of a system, but they also come in large quantities which makes them challenging to analyze. You can minimize this issue by creating a series of scripts to automate data formatting, aggregation, and alerting.

In the next chapter we will look at how similar techniques can be leveraged to monitor networks for configuration changes.

Exercises

  1. add a -i option to livebar.sh to set the interval in seconds.

  2. add a -M option to livebar.sh to set an expected maximum for input values. Use the getopts builtin to parse your options.

  3. How might you add a -f option that filters data using (using, e.g., grep)? What challenges might you encounter? What approach(es) might you take to deal with those?

  4. Modify wintail.sh to allow the user to specify the Windows log to be monitored by passing in a command line argument.

  5. Modify wintail.sh to add the capability for it to be a lightweight intrusion detection system using egrep and an IOC file.

  6. Consider the statement made in the note about buffering: “When the input is coming from a file, that usually happens quickly.” Why “usually”? Under what conditions might you see the need for the line buffering option on grep even when reading from a file?

About the Authors

Paul Troncone

https://www.linkedin.com/in/paultroncone

https://www.digadel.com

Paul Troncone has over 15 years of experience in the cybersecurity and information technology fields. In 2009 Paul founded the Digadel Corporation where he performs independent cybersecurity consulting and software development. He holds a Bachelor of Arts degree in Computer Science from Pace University, a Master of Science degree in Computer Science from the Tandon School of Engineering at New York University (Formerly Polytechnic University), and is a Certified Information Systems Security Professional. Paul has served in a variety of roles including as a vulnerability analyst, software developer, penetration tester, and college professor.

 

Carl Albing

https://www.linkedin.com/in/albing

Carl Albing is a teacher, researcher, and software engineer with a breadth of industry experience. A co-author of O’Reilly’s “bash Cookbook”, he has worked in software for companies large and small, across a variety of software industries. He has a B.A. in Mathematics, Masters in International Management, and a Ph.D. in Computer Science. He has recently spent time in academia as a Distinguished Visiting Professor in the Department of Computer Science at the U.S. Naval Academy where he taught courses on Programming Languages, Compilers, High Performance Computing, and Advanced Shell Scripting. He is currently a Research Professor in the Data Science and Analytics Group at the Naval Postgraduate School.

Rapid Cybersecurity Ops

Attack, Defend, and Analyze with bash

Paul Troncone and Carl Albing

Rapid Cybersecurity Ops

by Paul Troncone and Carl Albing

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • Editor: Virginia Wilson
  • Production Editor: Justin Billing
  • Interior Designer: David Futato
  • Cover Designer: Karen Montgomery
  • Illustrator: Rebecca Demarest
  • May 2019: First Edition

Revision History for the First Early Release

  • 2018-10-09: First Release
  • 2018-11-27: Second Release
  • 2019-01-28: Third Release

See http://oreilly.com/catalog/errata.csp?isbn=9781492041313 for release details.

Chapter 1. Command Line Primer

The command line is one of the oldest interfaces used to interact with a computer. The command line has evolved over several decades of use and development, and is still an extremely useful and powerful way to interface with a computer. In many cases, the command line can be faster and more efficient than a Graphical User Interface (GUI) at accomplishing a task.

The bash shell and command language will be used for demonstrations throughout this book. That is due to its wide-scale adoption across multiple computing platforms and rich command set.

Commands and Arguments

The basic operation of bash is to execute a command, that is, to run another program. When several words appear on the command line bash assumes that the first word is the name of the program to run and the remaining words are the arguments to the command. For example:

mkdir -p /tmp/scratch/garble

will have bash run the command called mkdir and it will pass it two arguments -p and /tmp/scratch/garble. By convention programs generally put their options first, and have them begin with a leading "-“, as is the case here with the -p option. This particular command will create a directory called /tmp/scratch/garble. The -p option will mean that no errors will be reported and any intervening directories will be created (or attempted) as needed (e.g., if only /tmp exists, it will create /tmp/scratch before attempting to create /tmp/scratch/garble).

Standard Input/Output/Error

A running program is called a process and every process in the Unix/Linux/Posix (and thus Windows) environment has three distinct input/output file descriptors. These three are called “standard input” (or stdin, for short), “standard output” (stdout), and “standard error” (stderr).

As you might guess by its name, stdin is the default source for input to a program, by default the characters coming from the keyboard. When your script reads from stdin it is reading characters typed on the keyboard or (as we shall see shortly) it can be changed to read from a file. Stdout is the default place for sending output from a program. By default the output appears in the window which is running your shell or shell script. Standard error can also be sent output from a program, but it is (or should be) where error messages are written. It’s up to the person writing the program to direct any output to either stdout or stderr. So be conscientious when writing your scripts to send any error messages not to stdout but to stderr (as shown below).

Redirection and Piping

One of the great innovations of the shell was that it gave you a mechanism whereby you could take a running program and change where it got its input and/or change where it sent its output without modifying the program itself. If you have a program called handywork and it reads its input from stdin and writes its results to stdout, then you can change its behavior as simply as this:

handywork < data.in  > results.out

which will run handywork but will have the input come not from the keyboard but instead from the data file called data.in (assuming such a file exists and has input in the format we want). Similarly the output is being sent not to the screen but into a file called results.out (which will be created if it doesn’t exist and overwritten if it does). This technique is called “redirection” because we are re-directing input to come from a different place and re-directing output to go somewhere other than the screen.

What about stderr? The syntax is similar. We have to distinguish between stout and stderr when redirecting data coming out of the program and we make this distinction through the use of the file descriptor numbers. Stdin is file descriptor 0, stdout is file descriptor 1 and stderr is file descriptor 2 so we can redirect error messages this way:

handywork 2> err.msgs

which will redirect only stderr and send any such error message output to a file we called err.msgs (for obvious reasons).

Of course we can do all three on the same line:

handywork < data.in  > results.out  2> err.msgs

Sometimes we want the error messages combined with the normal output (as it does by default when both are written to the screen). We can do this with the following syntax:

handywork < data.in  > results.out 2>&1

which says to send stderr (2) to the same location as file descriptor 1 (”&1“). Note that without the ampersand, the error messages would just be sent to a file named “1”. This combining of stdout and stderr is so common that there is a useful shorthand notation:

handywork < data.in  &> results.out

If you want to discard standard output you can redirect it to a special file called /dev/null as follows:

handywork < data.in > /dev/null

To view output on the command line and simultaneously redirect that same output to a file, use the tee command. The following will display the output of handywork to the screen and also save it to results.out:

handywork < data.in | tee results.out

A file will be created or truncated (i.e., contents discarded) when output is redirected. If you want to preserve the file’s existing content you can, instead, append to the file using a double greater-than sign, like this:

handywork < data.in  >> results.out

This will execute handywork and then any output from stdout will be appended to the file results.out rather than overwriting its existing content.

Similarly this command line:

handywork < data.in  &>> results.out

will execute handywork and then append both stdout and stderr to the file results.out rather than overwriting its existing content.

Running Commands in the Background

Throughout this book we will be going beyond one-line commands and will be building complex scripts. Some of these scripts can take a significant amount of time to execute, so much so that you may not want to spend time waiting for them to complete. Instead, you can run any command or script in the background using the & operator. The script will continue to run, but you can issue other commands and/or run other scripts. For example, to run ping in the background and redirect standard output to a file:

ping 192.168.10.56 > ping.log &

You will likely want to redirect standard output and/or standard error to a file when sending tasks to the background, or the task will continue to print to the screen and interrupt other activities you are performing.

Warning

Be cautious not to confuse &, which is used to send a task to the background, and &> which is used to perform a combined redirect of standard output and standard error.

You can use the jobs command to list any tasks currently running in the background.

$ jobs
[1]+  Running                 ping 192.168.10.56 > ping.log &

Use the fg command and the corresponding job number to bring the task back into the foreground.

$ fg 1
ping 192.168.10.56 > ping.log

If your task is currently executing in the foreground you can use CTRL-Z to suspend the process and then bg to continue the process in the background. From there you can use jobs and fg as described above.

From Command Line to Script

A shell script is just a file that contains the same commands that you could type on the command line. Put one or more commands into a file and you have a shell script. If you called your file myscript you can run that script by typing: bash myscript or you can give it “execute permission” (e.g., chmod 755 myscript) and then you can invoke it directly: ./myscript to run the script. We often include the following line as the first line of the script, which tells the operating system which scripting language we are using:

#!/bin/bash -

Of course this assumes that bash is located in the /bin directory. If your script needs to be more portable, you could use this approach instead:

#!/usr/bin/env bash

It uses the env command to look up the location of bash and is considered the standard way to address the portability problem. It makes the assumption, however, that the env command is to be found in /usr/bin.

Summary

In this chapter you saw how to run single commands and redirect input and output. In the next chapter we will discuss the real power of scripting, which comes from being able to run commands repeatedly, make decisions in the script, and loop over a variety of inputs.

Exercises

  1. Write a command that executes ifconfig and redirects standard output to a file named ipaddress.txt.

  2. Write a command that executes ifconfig and redirects standard output and appends it to a file named ipaddress.txt.

  3. Write a command that copies all of the files in the directory /etc/a to the directory /etc/b and redirects standard error to the file copyerror.log.

  4. Write a command that performs a directory listing (ls) on the root file directory and pipes the output into the more command.

  5. Write a command that executes mytask.sh and sends it to the background.

  6. Given the job list below, write the command that brings the Amazon ping task to the foreground.

    [1]   Running                 ping www.google.com > /dev/null &
    [2]-  Running                 ping www.amazon.com > /dev/null &
    [3]+  Running                 ping www.oreilly.com > /dev/null &

Chapter 2. Bash Primer

Bash should be thought of as a programming language whose default operation is to launch other programs. Here is a brief look at some of the features that make bash a powerful programming language, especially for scripting.

Output

As with any programming language, bash has the ability to output information to the screen. Output can be achieved using the echo command.

$ echo "Hello World"

Hello World

You may also use the printf command which allows for some additonal formatting.

$ printf "Hello World"

Hello World

Variables

Bash variables begin with an alphabetic character or underscore followed by alphanumeric characters. They are string variables unless declared otherwise. To assign a value to the variable, you write something like this:

MYVAR=textforavalue

To retrieve the value of that variable, for example to print out the value using the echo command, you use the $ in front of the variable name, like this:

echo $MYVAR

If you want to assign a series of words to the variable, that is, to preserve any whitespace, use quotation marks around the value, as in:

MYVAR='here is a longer set of words'
OTHRV="either double or single quotes will work"

The use of double quotes will allow other substitutions to occur inside the string. For example:

firstvar=beginning
secondvr="this is just the $firstvar"
echo $secondvr

This will result in the output: this is just the beginning

There are a variety of substitutions that can occur when retrieving the value of a variable; we will show those as we use them in the scripts to follow.

Warning

Remember that by using double quotes (") any substitutions that begin with the $ will still be made, whereas inside single quotes (') no substitutions of any sort are made.

You can also store the output of a shell command using $( ) as follows:

CMDOUT=$(pwd)

That will execute the command pwd in a sub-shell and rather than printing the the result to stdout, it will store the output of the command in the variable CMDOUT. You can also pipe together multiple commands within the $ ( ).

Positional Paramaters

It is common when using command line tools to pass data into the commands using arguments or parameters. Each parameter is separated by the space character and is accessed inside of bash using a special set of identifiers. In a bash script, the first parameter passed into the script can be accessed using $1, the second using $2, and so on. $0 is a special parameter that holds the name of the script, and $# returns the total number of parameters. Take the following script:

Example 2-1. echoparams.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# echoparams.sh
#
# Description:
# Demonstrates accessing parameters in bash
#
# Usage:
# ./echoparms.sh <param 1> <param 2> <param 3>
#

echo $#
echo $0
echo $1
echo $2
echo $3

This script first prints out the number of parameters ($#), then the name of the script ($0), and then the first three parameters. Here is the output:

$ ./echoparams.sh bash is fun

3
./echoparams.sh
bash
is
fun

Input

User input is received in bash using the read command. The read command will obtain user input from the command line and store it in a specified variable. The script below reads user input into the MYVAR variable and then prints it to the screen.

read MYVAR
echo "$MYVAR"

Conditionals

Bash has a rich variety of conditionals. Many, but not all, begin with the keyword if.

Any command or program that you invoke in bash may do some output but it will also always return a success or fail value. In the shell this value can be found in the $? variable immediately after a command has run. A return value of 0 is considered “success” or “true”; any non-zero value is considered “error” or “false”. The simplest form of the if statement uses this fact. It takes the form:

if cmd
then
   other cmds
fi

For example, the script below attempts to change directories to /tmp. If that command is successful (returns 0) the body of the if statement will execute.

if  cd /tmp
then
    echo "here is what is in /tmp:"
    ls -l
fi

Bash can even handle a pipeline of commands in a similar fashion:

if ls | grep pdf
then
    echo found one or more pdf files here
fi

With a pipeline, it is the success/failure of the last command in the pipeline that determines if the “true” branch is taken. Here is an example where that fact matters:

ls | grep pdf | wc

This series of commands will be “true” even if no pdf string is found by the grep command. That is because the wc command (a word count of the input) will print:

0       0       0

That output indicates 0 characters, 0 words, and 0 lines when no output comes from the grep command. That is still a successful (or true) result, not an error or failure. It counted as many lines as it was given, even if it was given zero lines to count.

A more typical form of if used for comparisons makes use of the compound command [[ or the shell built-in commands [ or test. Use these to test file attributes or to make comparisons of value.

To test if a file exists on the file system:

if [[ -e $FILENAME ]]
then
    echo $FILENAME exists
fi

Table 2-1 lists additional tests that can be done on files using if comparisons.

Table 2-1. File Test Operators
File Test Operator Use

-d

Test if a directory exists

-e

Test if a file exists

-r

Test if a file exists and is readable

-w

Test if a file exists and is writable

-x

Test if a file exists and is executable

To test if the variable $VAL is less than the variable $MIN:

if [[ $VAL -lt $MIN ]]
then
    echo "value is too small"
fi

Table 2-2 lists additional numeric tests that can be done using if comparisons.

Table 2-2. Numeric Test Operators
Numeric Test Operator Use

-eq

Test for equality between numbers

-gt

Test if one number is greater than another

-lt

Test if one number is less than another

Warning

Be cautious of using the < symbol. Take the following code:

if [[ $VAL < $OTHR ]]

This operator is a less-than but in this context it uses lexical (alphabetical) ordering. That means that 12 is less than 2, since they alphabetically sort in that order. (Just like a < b, so 1 < 2, but also 12 < 2anything)

If you want to do numerical comparisons with the less-than sign, use the double parentheses construct. It assumes that the variables are all numerical and will evaluate them as such. Empty or unset variables are evaluated as 0. Inside the parentheses you don’t need the $ operator to retrieve a value, except for positional parameters like $1 and $2 (so as not to confuse them with the constants 1 and 2). For example:

if (( VAL < 12 ))
then
    echo "value $VAL is too small"
fi

In bash you can even make branching decisions without an explicit if/then construct. Commands are typically separated by a newline - that is, they appear one per line. You can get the same effect by separating them with a semicolon. If you write cd $DIR ; ls then bash will perform the cd and then the ls.

Two commands can also be separated by either && or || symbols. If you write cd $DIR && ls then the ls command will run only if the cd command succeeds. Similarly if you write cd $DIR || echo cd failed the message will be printed only if the cd fails.

You can use the [[ syntax to make various tests, even without an explicit if.

[[ -d $DIR ]] && ls "$DIR"

means the same as if you had written

if [[ -d $DIR ]]
then
  ls "$DIR"
fi
Warning

When using && or || you will need to group multiple statements if you want more than one action within the “then” clause. For example:

[[ -d $DIR ]] || echo "error: no such directory: $DIR" ; exit

will always exit, whether or not $DIR is a directory.

What you probably want is this:

[[ -d $DIR ]] || { echo "error: no such directory: $DIR" ; exit ; }

where the braces will group both statements together.

Looping

Looping with a while statement is similar to the if construct in that it can take a single command or a pipeline of commands for the decision of true or false. It can also make use of the brackets or parentheses as in the if examples, above.

In some languages braces ( { } ) are used to group the statement together that are the body of the while loop. In others, like python, indentation is the indication of which statements are the loop body. In bash, however, the statements are grouped between two keywords: do and done.

Here is a simple while loop:

i=0
while (( i < 1000 ))
do
    echo $i
    let i++
done

The loop above will execute while the variable i is less than 1000. Each time the body of the loop executes it will print the value of i to the screen. It then uses the let command to execute i++ as an arithmetic expression, thus incrementing i by 1 each time.

Here is a more complicated while loop that executes commands as part of its condition.

while ls | grep -q pdf
do
    echo -n 'there is a file with pdf in its name here: '
    pwd
    cd ..
done

A for loop is also available in bash - in three variations.

Simple numerical looping can be done using the double parentheses construct. It looks much like the for loop in C or Java, but with double parentheses and with do and done instead of braces:

for ((i=0; i < 100; i++))
do
    echo $i
done

Another useful form of the for loop is used to iterate through all the parameters that are passed to a shell script (or function within the script), that is, $1, $2, $3 an so on. Note that ARG in args.sh can be replaced with any variable name of your choice.

for ARG
do
    echo here is an argument: $ARG
done

Here is the output of args.sh when three parameters are passed in.

$ ./args.sh bash is fun

here is an argument: bash
here is an argument: is
here is an argument: fun

Finally, for an arbitrary list of values, use a similar form of the for statement simply naming each of the values you want for each iteration of the loop. That list can be explicitly written out, like this:

for VAL in 20 3 dog peach 7 vanilla
do
    echo $VAL
done

The values used in the for loop can also be generated by calling other programs or using other shell features:

for VAL in $(ls | grep pdf) {0..5}
do
    echo $VAL
done

Here the variable VAL will take, in turn, the value for each of the filenames that ls piped into grep finds with the letters pdf in its filename (e.g. “doc.pdf” or “notapdfile.txt”) and then each of the numbers 0 through 5. It may not be that sensible to have the variable VAL be a filename sometimes and a single digit another time, but this shows you that it can be done.

Functions

Define a function with sytnax like this:

function myfun ()
{
  # body of the function goes here
}

Not all that syntax is necessary - you can use either "function" or "()" - you don’t need both. We recommend, and will be using, both - mostly for readability.

There are a few important considerations to keep in mind with bash functions:

  • Unless declared with the local builtin command inside the function, variables are global in scope. A for loop which sets and increments i could be messing with the value of i used elsewhere in your code.

  • The braces are the most commonly used grouping for the function body, but any of the shell’s compound command syntax is allowed - though why, e.g., would you want the function to run in a sub-shell?

  • Redirecting I/O on the braces does so for all the statements inside the function. Examples of this will be seen in upcomoing chapters.

  • No parameters are declared in the function definition. Whatever and however many arguments are supplied on the invocation of the function are passed to it.

The function is called (invoked) just like any command is called in the shell. Having defined myfun as a function you can call it like this:

myfun 2 /arb "14 years"

which calls the function myfun supplying it with 3 arguments.

Function Arguments

Inside the function defintion arguments are referred to in the same way as parameters to the shell script — as $1, $2, etc. Realize that this means that they “hide” the parameters originally passed to the script. If you want access to the script’s first parameter, you need to store $1 into a variable before you call the function (or pass it as a paramter to the function).

Other variables are set accordingly, too. $# gives the number of arguments passed to the function, whereas normally it gives the number of arguments passed to the script itself. The one exception to this is $0 - it doesn’t change in the function. It retains its value as the name of the script (and not of the function).

Returning Values

Functions, like commands, should return a status - a 0 if all goes well and a non-zero value if some error has occurred. To return some other kinds of values - pathnames or computed values for example - you can either set a variable to hold that value - since those variables are global unless declared local within the function, or you can send the result to stdout, that is, print the answer. Just don’t try to do both.

Warning

If you print the answer you’ll typically use that output as part of a pipeline of commands (e.g., myfunc args | next step | etc ) or you’ll capture the output like this: RESVAL=$( myfunc args ). In both cases the function will be run in a sub-shell and not in the current shell. Thus changes to any global variables will only be effective in that sub-shell and not in the main shell instance. They are effectively lost.

Pattern Matching in bash

When you need to name a lot of files on a command line, you don’t need to type each and every name. Bash provides pattern matching (sometimes called “wildcarding”) to allow you to specify a set of files with a pattern. The easiest one is simply an asterisk * (or “star”) which will match any number of any characters. When used by itself, therefore, it matches all files in the current directory. The asterisk can be used in conjunction with other characters. For example \*.txt matches all the files in the current directory which end with the four characters .txt. The pattern /usr/bin/g\* will match all the files in /usr/bin that begin with the letter g.

Another special character in pattern matching is ? the question mark, which matches a single character. For example, source.? will match source.c or source.o but not source.py or source.cpp.

The last of the three special pattern matching characters are [ ], the square brackets. A match can be made with any one of the characters listed inside the square brackets, so the pattern x[abc]y matches any or all of the files named xay, xby, or xcy, assuming they exist. You can specify a range within the square brackets, like [0-9] for all digits. If the first character within the brackets is either a \! or a ^ then the pattern means anything other than the remaining characters in the brackets. For example, [aeiou] would match a vowel whereas [^aeiou] would match any character except the vowels (including digits and punctuation characters).

Similar to ranges, you can specify character classes within braces. Table 2-3 lists the character classes and their description.

Table 2-3. Pattern Matching Character Classes
Character Class Description

[:alnum:]

Alphanumeric

[:alpha:]

Alphabetic

[:ascii:]

ASCII

[:blank:]

Space and Tab

[:ctrl:]

Control Characters

[:digit:]

Number

[:graph:]

Anything Other Than Control Characters and Space

[:lower:]

Lowercase

[:print:]

Anything Other Than Control Characters

[:punct:]

Punctuation

[:space:]

Whitespace Including Line Breaks

[:upper:]

Uppercase

[:word:]

Letters, Numbers, and Underscore

[:xdigit:]

Hexadecimal

Character classes are specified like this: [:cntrl:] within square brackets (so you have two sets of []). For example, this pattern: \*[[:punct:]]jpg will match any filename that has any number of any characters followed by a punctuation character followed by the letters jpg. So it would match files named wow!jpg or some,jpg or photo.jpg but not a file named this.is.myjpg since there is no punctuation character right before the jpg.

There are more complex aspects of pattern matching if you turn on the shell option extglob (like this: shopt -s extglob) so that you can repeat patterns or negate patterns. We won’t need these in our example scripts but we encourage you to learn about them (e.g., via the bash man page).

There are a few things to keep in mind when using shell pattern matching:

  • Patterns aren’t regular expressions (discussed later); don’t confuse the two.

  • Patterns are matched against files in the file system; if the pattern begins with a pathname (e.g., /usr/lib ) then the matching will be done against files in that directory.

  • If no pattern is matched, the shell will use the special pattern matching characters as literal characters of the filename; for example, if your script says echo data > /tmp/*.out but there is no file in /tmp that ends in .out then the shell will create a file called *.out in the /tmp directory. Remove it like this: rm /tmp/\*.out by using the backslash to tell the shell not to pattern match with the asterisk.

  • No pattern matching occurs inside of quotes (either double or single quotes), so if your script says echo data > "/tmp/*.out" it will create a file called /tmp/*.out (which we recommend you avoid doing).

Note

The dot, or period, is just an ordinary character and has no special meaning in shell pattern matching - unlike in regular expressions which will be discussed later.

Writing Your First Script - Detecting Operating System Type

Now that we have gone over the fundamentals of the command line and bash you are ready to write your first script. The bash shell is available on a variety of platforms including Linux, Windows, macOS, and Git Bash. As you write more complex scripts in the future it is imperative that you know what operating system you are interacting with as each one has a slightly different set of commands available. The osdetect.sh script helps you in making that determination.

The general idea of the script is that it will look for a command that is unique to a particular operating system. The limitation is that on any given system an administrator may have created and added a command with that name, so this is not foolproof.

Example 2-2. osdetect.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# osdetect.sh
#
# Description:
# Distinguish between MS-Windows/Linux/MacOS
#
# Usage:
# Output will be one of: Linux MSWin macOS
#

if type -t wevtutil &> /dev/null           1
then
    OS=MSWin
elif type -t scutil &> /dev/null           2
then
    OS=macOS
else
    OS=Linux
fi
echo $OS
1

We use the type built-in in bash to tell us what kind of a command (alias, keyword, function, built-in, or file) its arguments are. The -t option tells it to print nothing if the command isn’t found. The command returns as “false” in that case. We redirect all the output (both stdout and stderr) to /dev/null thereby throwing it away, as we only want to know if the wevtutil command was found.

2

Again we use the type built-in but this time we are looking for the scutil command which is available on macOS systems.

Summary

The bash shell can be seen as a programming language, one with variables and if/then/else statements, loops, and functions. It has its own syntax, similar in many ways to other programming languages, but just different enough to catch you if you’re not careful.

It has its strengths - like easily invoking other programs or connecting sequences of other programs - and it has its weaknesses: it doesn’t have floating point arithmetic or much support (though some) for complex data structures.

In the chapters ahead we will describe and use many bash features and OS commands in the context of cybersecurity operations. We will further explore some of the features we have touched on here, and other more advanced or obsure features. Keep your eyes out for those featues and practice and use them for your own scripting.

Exercises

  1. Experiment with the uname command, seeing what it prints on the various operating systems. Re-write the osdetect.sh script to use the uname command, possibly with one of its options. Caution: not all options are available on every operating system.

  2. Modify the osdetect.sh script to use a function. Put the if/then/else logic inside the function and then call it from the script. Don’t have the function itself do any output. Make the output come from the main part of the script.

  3. Set the permissions on the osdetect.sh script to be executable (see man chmod) so that you can run the script without using bash as the first word on the command line. How do you now invoke the script?

  4. Write a script called argcnt.sh that tells how many arguments are supplied to the script.

    1. Modify your script to have it also echo each argument one per line.

    2. Modify your script further to label each argument like this:

      $ bash argcnt.sh this is a "real live" test
      there are 5 arguments
      arg1: this
      arg2: is
      arg3: a
      arg4: real live
      arg5: test
      $
  5. Modify argcnt.sh so it only lists the even arguments.

Chapter 3. Regular Expressions

Regular expressions (regex) are a powerful method for describing a text pattern to be matched by various tools. There is only one place in bash where regular expressions are valid, using the =~ comparison in the [[ compound command, as in an if statement. However, regular expressions are a crucial part of the larger toolkit for commands like grep, awk, and sed in particular. They are very powerful and thus worth knowing. Once mastered, you’ll wonder how you ever got along without them.

For many of the examples in this chapter we will be using the file frost.txt with its seven, yes seven, lines of text.

Example 3-1. frost.txt
1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

The content of Frost.txt will be used to demonstrate the power of regular expressions to process text data. This text was chosen because it requires no prior knowledge to understand.

Commands in Use

We introduce the grep family of commands to demonstrate the basic regex patterns.

grep

The grep command searches the content of the files for a given pattern and prints any line where the pattern is matched. To use grep, you need to provide it with a pattern and one or more filenames (or piped data).

Common Options

-c

Count the number of lines that match the pattern.

-E

Enable extended regular expressions

-f

Read the search pattern from a provided file. A file can contain more than one pattern, with each line containing a single pattern.

-i

Ignore character case.

-l

Only print the file name and path where the pattern was found.

-n

Print the line number of the file where the pattern was found.

-P

Enables the Perl regular expression engine.

-R, -r

Recursively search sub-directories.

Command Example

In general, the way grep is used is like this: grep options pattern filenames

To search the /home directory and all sub-directories for files containing the word password irrespective of uppercase/lowercase distinctions:

grep -R -i 'password' /home

grep and egrep

The grep command supports some variations, notably an extended syntax for the regex patterns (we’ll discuss the regex patterns next). There are three different ways to tell grep that you want special meaning on certain characters: 1) by preceding those characters with a backslash; or 2) by telling grep that you want the special syntax (without the need for backslash) by using the -E option when you invoke grep; or 3) by using the command named egrep which is just a script that simply invokes grep as grep -E so you don’t have to.

The only characters that are affected by the extended syntax are: ? + { | ( and ). In the examples that follow we will use grep and egrep interchangeably - they are the same binary underneath. We will choose the one to use that seems most appropriate based on what special characters we need. The special, or meta-, characters are what make grep so powerful. Here is what you need to know about the most powerful and frequently used metacharacters.

Regular Expression Metacharacters

Regular expressions are patterns that are created using a series of characters and metacharacters. Metacharacters such as "?" and "*" have special meaning beyond their literal meaning in regex.

The “.” Metacharacter

In regex, the “.” represents a single wildcard character. It will match on any single character except for a newline. As can be seen in the example below, if we try to match on the pattern T.o the first line of the frost.txt file is returned because it contains the word Two.

$ grep 'T.o' frost.txt

1    Two roads diverged in a yellow wood,

Note that line 5 is not returned even though it contains the word To. This pattern allows any character to appear between the T and o, but as written there must be a character in between. Regex patterns are also case sensitive, which is why line 3 of the file was not returned even though it contains the string too. If you want to treat "." as a period character rather than a wildcard, precede it with a backslash "\." to escape its special meaning.

The “?” Metacharacter

In regex, the “?” character makes any item that precedes it optional; it matches it zero or one time. By adding this metacharacter to the previous example we can see that the output is different.

$ egrep 'T.?o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;

This time we see that both lines 1 and 5 are returned. This is because the metacharacter "." is optional due to the "?" metacharacter that follows it. This pattern will match on any three-character sequence that begins with T and ends with o as well as the two-character sequence To.

Notice that we are using egrep here. We could have used grep -E or we could have used “plain” grep with a slightly different pattern: T.\?o putting the backslash on the question mark to give it the extended meaning.

The “*” Metacharacter

In regex, the "*" is a special character that matches the preceding item zero or more times. It is similar to the "?“, the main difference being that the previous item may appear more than once.

$ grep 'T.*o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The ".*" in the pattern above allows any number of any character to appear in between the T and o. Thus the last line also matches because it contains the pattern The Ro.

The “+” Metacharacter

The "+" metacharacter is the same as the "*" except it requires the preceding item to appear at least once. In other words it matches the preceding item one or more times.

$ egrep 'T.+o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The pattern above specifies one or more of any character to appear in between the T and o. The first line of text matches because of Two - the w is 1 character between the T and the o. The second line doesn’t match the To, as in the previous example; rather, the pattern matches a much larger string — all the way to the o in undergrowth. The last line also matches because it contains the pattern The Ro.

Grouping

We can use parentheses to group together characters. Among other things, this allows us to treat the characters appearing inside the parenthesis as a single item which we can later reference.

$ egrep 'And be one (stranger|traveler), long I stood' frost.txt

3    And be one traveler, long I stood

In the example above we use parenthesis and the Boolean OR operator "|" to create a pattern that will match on line 3. Line 3 as written has the word traveler in it, but this pattern would match even if traveler was replaced by the word stranger.

Brackets and Character Classes

In regex the square brackets, [ ], are used to define character classes and lists of acceptable characters. Using this construct you can list exactly which characters are matched at this position in the pattern. This is particularly useful when trying to perform user input validation. As a shorthand you can specify ranges with a dash such as [a-j]. These ranges are in your locale’s collating sequence and alphabet. For the C locale, the pattern [a-j] will match one of the letters a through j. Table 3-1 provides a list of common examples when using character classes and ranges.

Table 3-1. Regex character ranges
Example Meaning

[abc]

Match only the character a or b or c

[1-5]

Match on digits in the range 1 to 5

[a-zA-Z]

Match any lowercase or uppercase a to z

[0-9+-*/]

Match on numbers or these 4 mathematical symbols

[0-9a-fA-F]

Match a hexadecimal digit

Warning

Be careful when defining a range for digits; the range can at most go from 0 to 9. For example, the pattern [1-475] does not match on numbers between 1 and 475, it matches on any one of the digits (characters) in the range 1-4 or the character 7 or the character 5.

There are also predefined character classes known as shortcuts. These can be used to indicate common character classes such as numbers or letters. See Table 3-2 for a list of shortcuts.

Table 3-2. Regex shortcuts
Shortcut Meaning

\s

Whitespace

\S

Not Whitespace

\d

Digit

\D

Not Digit

\w

Word

\W

Not Word

\x

Hexadecimal Number (e.g. 0x5F)

Note that the above shortcuts are not supported by egrep. In order to use them you must use grep with the -P option. That option enables the Perl regular expression engine to support the shortcuts. For example, to find any numbers in frost.txt:

$ grep -P '\d' frost.txt

1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

There are other character classes (with a more verbose syntax) that are valid only within the bracket syntax, as seen in Table 3-3. They match a single character, so if you need to match many in a row, use the star or plus to get the repetition you need.

Table 3-3. Regex character classes in brackets
Character Class Meaning

[:alnum:]

any alphanumeric character

[:alpha:]

any alphabetic character

[:cntrl:]

any control character

[:digit:]

any digit

[:graph:]

any graphical character

[:lower:]

any lowercase character

[:print:]

any printable character

[:punct:]

any punctuation

[:space:]

any whitespace

[:upper:]

any uppercase character

[:xdigit:]

any hex digit

To use one of these classes it has to be inside the brackets, so you end up with two sets of brackets. For example: grep '[[:cntrl:]]' large.data will look for lines containing control characters (ASCII 0-25). Here is another example:

grep 'X[[:upper:][:digit:]]' idlist.txt

will match any line with an X followed by any uppercase letter or digit. It would match these lines:

User: XTjohnson
an XWing model 7
an X7wing model

They each have an uppercase X followed immediately by either another uppercase letter or by a digit.

Back References

Regex back references are one of the most powerful and often confusing regex operations. Consider the following file, tags.txt:

1    Command
2    <i>line</i>
3    is
4    <div>great</div>
5    <u>!</u>

Suppose you want to write a regular expression that will extract any line that contains a matching pair of complete HTML tags. The start tag has an HTML tag name; the ending tag has the same tag name but with a leading slash. <div> and </div> are a matching pair. You could search for these by writing a lengthy regex that contains all possible HTML tag values, or you can focus on the format of an HTML tag and use a regex back reference.

$ egrep '<([A-Za-z]*)>.*</\1>' tags.txt

2    <i>line</i>
4    <div>great</div>
5    <u>!</u>

In this example, the back reference is the \1 appearing in the latter part of the regular expression. It is referring back to the expression enclosed in first set of parentheses, [A-Za-z]* which has two parts. The letter range in brackets denotes a choice of any letter, uppercase or lowercase. The asterisk (or star) that follows it means to repeat that zero or more times. Therefore the \1 refers to whatever was matched by that pattern in parentheses. If [A-Za-z]* matches div then the \1 also refers to the pattern div.

The overall regular expression, then, can be described as matching a < sign (that literal character is the first one in the regex) followed by zero or more letters then a > sign and then zero or more of any character “.” for any character, “*” for zero or more of the previous item) followed by another < and a slash and then the sequence matched by the expression within the parentheses and finally a > character. If this sequence matches any part of a line from our text file then egrep will print that line out.

You can have more than one back reference in an expression and refer to each with a \1 or \2 or \3 depending on its order in the regular expression. A \1 refers to the first set of parentheses, \2 to the second, and so on. Note that the parentheses are metacharacters - they have a special meaning. If you just want to match a literal parenthesis you need to escape its special meaning by preceding it with a backslash, as in: sin\([0-9.]*\) to match expressions like: sin(6.2) or sin(3.14159).

Note

Valid HTML doesn’t have to be all on one line; the end tag can be several lines away from the start tag. Moreover, some tags can both start and end in a single tag, such as <br/> for a break, or <p/> for an empty paragraph. We would need a more sophisticated approach to include such things in our search.

Quantifiers

Quantifiers specify the number of times an item must appear in a string. Quantifiers are defined by the curly brackets { }. For example, the pattern T{5} means that the letter T must appear consecutively exactly 5 times. The pattern T{3,6} means that the letter T must appear consecutively 3 to 6 times. The pattern T{5,} means that the letter T must appear 5 or more times.

Anchors and Word Boundaries

You can use anchors to specify that a pattern must exist at the beginning or the end of a string. The ^ character is used to anchor a pattern to the beginning of a string. For example ^[1-5] means that a matching string must start with one of the digits 1 through 5 as the first character on the line. The $ character is used to anchor a pattern to the end of a string or line. For example [1-5]$ means that a string must end with one of the digits 1 through 5.

In addition, you can use \b to identify a word boundary (i.e., a space). The pattern \b[1-5]\b will match on any of the digits 1 through 5 where the digit appears as its own word.

Summary

Regular expressions are extremely powerful for describing patterns and can be used in coordination with other tools to search and process data.

The uses and full syntax of regex far exceeds the scope of this book. You can visit the resources below for additional information and utilities related to regex.

In the next chapter we will discuss common data types relevant to security operations and how it can be gathered.

Exercises

  1. Write a regular expression that matches a floating point number (a number with a decimal point) such as 3.14. There can be digits on either side of the decimal point but there need not be any on one side or the other. Allow it to match just a decimal point by itself, too.

  2. Use a back reference in a regular expression to match a number that appears on both sides of an equal sign. For example, it should match “314 is = to 314” but not “6 = 7”

  3. Write a regular expression that looks for a line that begins with a digit and ends with a digit, with anything occurring in between.

  4. Write a regular expression that uses grouping to match on the following 2 IP addresses: 10.0.0.25 and 10.0.0.134.

  5. Write a regular expression that will match if the hexadecimal string 0x90 occurs more than 3 times in a row (i.e. 0x90 0x90 0x90).

Chapter 4. Data Collection

Data is the lifeblood of nearly every defensive security operation. Data tells you the current state of the system, what has happened in the past, and even what might happen in the future. Data is needed for forensic investigations, verifying compliance, and detecting malicious activity. Table 4-1 describes data that is commonly relevant to defensive operations and where it is typically located.

Table 4-1. Data of Interest
Data Data Description Data Location

Log Files

Details on historical system activity and state. Interesting log files include web and DNS server logs, router, firewall, and intrusion detection system logs, and application logs.

In Linux most log files are located in the /var/log directory. In a Windows system logs are found in the Event Log.

Command History

List of recently executed commands

In Linux the location of the history file can be found by executing echo $HISTFILE, and is typically located in the user’s home directory in .bash_history.

Temporary Files

Various user and system files that were recently accessed, saved, or processed

In Windows, temp files can be found in c:\windows\temp and %USERPROFILE%\AppData\Local\. In Linux temp files are typically located in /tmp and /var/tmp. The Linux temporary directory can also be found by using the command echo $TMPDIR.

User Data

Documents, pictures, and other user created files.

User files are typically located in /home/ in Linux and c:\Users\ in Windows.

Browser History

Web pages recently accessed by the user.

Varies widely based on operating system and browser

Windows Registry

Hierarchical database that stores settings and other data that is critical to the operation of Windows and applications

Windows Registry

Throughout this chapter we will explore various methods to gather data, locally and remotely, from both Linux and Windows systems.

Commands in Use

We introduce cut, file, head, and for Windows systems reg and wevtutil, to gather and select data of interest from local and remote systems.

cut

cut is a command used to extract select portions of a file. It reads a supplied input file line-by-line and parses the line based on a specified delimiter. If no delimiter is specified cut will use a TAB character by default. The delimiter characters divide each line of a file into fields. You can use either the field number or character position number to extract parts of the file. Fields and characters start at position 1.

Common Command Options

-c

Specify the character(s) to extract.

-d

Specifies the character used as a field delimiter. By default delimiter is the TAB character.

-f

Specify the field(s) to extract.

Command Example

Example 4-1. cutfile.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

In cutfile.txt each field is delimited using a space. To extract the IP address (field position 2) you can use the following command:

$ cut -d' ' -f2 file1.txt

192.168.10.14
192.168.10.185

The -d’ ' option specifies the space as the field delimiter. The -f2 option tells cut to return the second field, in this case, the IP address.

Warning

The cut command considers each delimiter character as separating a field. It doesn’t collapse white space. Consider the following example:

Pat   25
Pete  12

If we use cut on this file we would define the delimiter to be a space. In the first record there are 3 spaces between the name (Pat) and the number (25). Thus the number is in field #4. However for the next line, the name (Pete) is in field #3, since there are only two space characters between the name and the number. For a data file like this, it would be better to separate the name from the numbers with a single tab character and use that as the delimiter for cut.

file

The file command is used to help identify a given file’s type. This is particularly useful in Linux as most files are not required to have an extension which can be used to identify its type (c.f., .exe in Windows). The file command looks deeper than the filename by reading and analyzing the first block of data, also known as the magic number. Even if you rename a .png image file to end with a .jpg, the file command is smart enough to figure that out and tell you the correct file type (in this case, a PNG image file).

Common Command Options

-f

Read the list of files to analyze from a given file

-k

Do not stop on the first match, list all matches for the file type

-z

Look inside compressed files

Command Example

To identify the file type just pass the filename to the file command.

$ file unknownfile

unknownfile: Microsoft Word 2007+

head

The head command displays the first few lines or bytes of a file. By default head displays the first 10 lines.

Common Command Options

-n

Specify the number of lines to output. To show 15 lines you can specify it as -n 15 or -15.

-c

Specify the number of bytes to output.

reg

The reg command is used to manipulate the Windows Registry and is available in Windows XP and later.

Common Command Parameters

add

Adds an entry to the registry.

export

Copies the specified registry entries to a file.

query

Returns a list of subkeys below the specified path.

Command Example

To list the all of the root keys in the HKEY_LOCAL_MACHINE hive:

$ reg query HKEY_LOCAL_MACHINE

HKEY_LOCAL_MACHINE\BCD00000000
HKEY_LOCAL_MACHINE\HARDWARE
HKEY_LOCAL_MACHINE\SAM
HKEY_LOCAL_MACHINE\SECURITY
HKEY_LOCAL_MACHINE\SOFTWARE
HKEY_LOCAL_MACHINE\SYSTEM

wevtutil

Wevtutil is a command line utility to view and mange system logs in the Windows environment. It is available in most modern versions of Windows and is callable from Git Bash.

Common Command Parameters

el

Enumerate available logs

qe

Query a log’s events

Common Command Options

/c

Specify the maximum number of events to read

/f

Format the output as text or XML

/rd

Read direction, if set to true it will read the most recent logs first

Warning

In the Windows command prompt only a single / is needed before command options. In the Git Bash terminal two // are needed (ex. //c) due to the way commands are processed.

Command Example

To list all of the available logs:

wevtutil el

To view the most recent event in the System log using Git Bash:

wevtutil qe System //c:1 //rd:true
Tip

For additional information see Microsoft’s documentation at https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/wevtutil

Gathering System Information

One of the first steps in defending a system is understanding the state of the system and what it is doing. To accomplish this you need to gather data, either locally or remotely, for analysis.

Executing a Command Remotely Using SSH

The data you want may not always be available locally. You may need to connect to a remote system such as a web, File Transfer Protocol (FTP), or Secure Shell (SSH) server to obtain the desired data.

Commands can be executed remotely and securely using the Secure Shell (SSH) if the remote system is running the SSH service. In its basic form (no options) you can just add ssh and a hostname in front of any shell command to run that command on the specified host. For example, ssh myserver who will run the who command on the remote machine myserver. If you need to specify a different username: ssh username@myserver who or ssh -l username myserver who both do the same thing, just replace username with the username for which you would like to use to login. You can redirect the output to a file on your local system, or to a file on the remote system.

To run a command on a remote system and redirect the output to a file on your local system:

ssh myserver ps > /tmp/ps.out

To run a command on a remote system and redirect the output to a file on the remote system:

ssh myserver ps \> /tmp/ps.out

The backslash will escape the special meaning of the redirect (in the current shell) and simply pass the redirect character as the second word of the three words sent to myserver. When executed on the remote system it will be interpreted by that shell and redirect the output on the remote machine (myserver) and leave it there.

In addition you can take scripts that reside on your local system and run them on a remote system using SSH. To run the osdetect.sh script remotely:

ssh myserver bash < ./osdetect.sh

This runs the bash command on the remote system, but passes into it the lines of the osdetect.sh script directly from your local system. This avoids the need for a two-step process of, first, transferring the script to the remote system and then running that copied script. Output from running the script comes back to your local system and can be captured by re-directing stdout as we have shown with many other commands.

Gathering Linux Log Files

Log files for a Linux system are normally stored in the /var/log/ directory. To easily collect the log files into a single file use the tar command:

tar -czf ${HOSTNAME}_logs.tar.gz /var/log/

The option -c is used to create an archive file, -z to zip the file, and -f to specify a name for the output file. The HOSTNAME variable is a bash variable that is automatically set by the shell to the name of the current host. We include it in our filename so the output file will be given the same name as the system, which will help later with organization if logs are collected from multiple systems. Note that you will need to be logged in as a privileged user or use sudo in order to successfully copy the log files.

Table 4-2 lists some important and common Linux logs and their standard location.

Table 4-2. Linux Log Files
Log Location Description

/var/log/apache2/

Access and error logs for the Apache web server

/var/log/auth.log

Information on user logins, privileged access, and remote authentication

/var/log/kern.log

Kernal logs

/var/log/messages

General non-critical system information

/var/log/syslog

General system logs

To find more information on where log files are being stored for a given system refer to /etc/syslog.conf or /etc/rsyslog.conf on most Linux distributions.

Gathering Windows Log Files

In the Windows environment wevtutil can be used to manipulate and gather log files. Luckily this command is callable from Git Bash. The winlogs.sh script uses the wevtutil el parameter to list all available logs, and then the epl parameter to export each log to a file.

Example 4-2. winlogs.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# winlogs.sh
#
# Description:
# Gather copies of Windows log files
#
# Usage:
# winlogs.sh [-z]
#   -z Tar and zip the output
#

TGZ=0
if (( $# > 0 ))						1
then
    if [[ ${1:0:2} == '-z' ]]				2
    then
	TGZ=1	# tgz flag to tar/zip the log files
    fi
fi
SYSNAM=$(hostname)
LOGDIR=${1:-/tmp/${SYSNAM}_logs}			3

mkdir -p $LOGDIR					4

wevtutil el | while read ALOG				5
do
    ALOG="${ALOG%$'\r'}"				6
    echo "${ALOG}:"					7
    wevtutil epl "$ALOG"  "${LOGDIR}/${SYSNAM}_${ALOG// /_}.evtx"  8
done

if (( TGZ == 1 ))					  9
then
    cd ${LOGDIR} && tar -czvf ${SYSNAM}_logs.tgz *.evtx   10
fi
1

The script begins with a simple initialization and then an if statement, one that checks to see if any arguments were provided to the script. The $# is a special shell variable whose value is the number of arguments supplied on the command line when this script is invoked. This conditional for the if is an arithmetic expression, because of the double parentheses. Therefore the comparison can use the greater-than character > and it will do a numerical comparison. If that symbol is used in an if expression with square brackets rather than double parentheses, the greater-than character > does a comparison of lexical ordering — alphabetical order. You would need to use -gt for a numerical comparison inside square brackets.

For this script the only argument we are supporting is a -z option to indicate that the log files should all be zipped up into a single tar file when its done collecting log files. This also means that we can use a simplistic type of argument parsing. We will use a more sophisticated argument parser (getopts) in an upcoming script.

2

This check takes a substring of the 1st argument ($1) starting at the beginning of the string (an offset of zero bytes), two bytes long. If the argument is, in fact, a -z then we will set a flag. The script also does a shift to remove that argument. What was the second argument, if any, is now the first. The third, if any, becomes the second, and so on.

3

If the user wants to specify a location for the logs it can be specified as an argument to the script. The optional -z argument, if supplied, has already been shift-ed out of the way, so any user supplied path would now be the first argument. If no value was supplied on the command line then the expression inside the braces will return a default value as indicated to the right of the minus sign. We use the braces around SYSTEM because the _logs would otherwise be considered part of the variable name.

4

The -p option to mkdir will create the directory and any intervening directories. It will also not give an error message if the directory exists.

5

Here we invoke wevtutil el to list all the possible log files. The output is piped into a while loop which will read one line, that is, one log filename, at a time.

6

Since this is running on a MSWindows system each line printed by wevtutil will end with both a newline (\n) and a return (\r) character. We remove the character from the right hand side of the string using the % operator. To specify the (non-printing) return character, we use the $'string' construct which substitutes certain backslash-escaped characters with non-printing characters (as defined in the ANSI C standard). So the two characters of \r are replaced with an ASCII 13 character, the return character.

7

We echo the filename to provide an indication to the user of progress being made and which log is currently being fetched.

8

The fourth word on this line is the filename into which we want wevtutil to store the log file it is producing. Since the name of the log as provided may have blanks we replace any blank with an underscore character. While not strictly necessary, it avoids requiring quotes when using the filename. The syntax, in general, is ${VAR/old/new} to retrieve the value of VAR with a substitution: replacing old with new. Using a double slash, ${VAR//old/new} replaces all occurrences, not just the first.

Warning

A common mistake is to type ${VAR/old/new/} but the trailing slash is not part of the syntax and will simply be added to the resulting string if a substitution is made. For example, if VAR=embolden then ${VAR/old/new/} would return embnew/en.

9

This is another arithmetic expression, enclosed in double parentheses. Within those expressions bash doesn’t require the $ in front of most variable names. It would still be needed for positional parameters like $1 to avoid confusion with the integer 1.

10

Here we separate two commands with a double ampersand && which tells the shell to execute the second command only if the first command succeeds. That way the tar doesn’t happen unless the cd is successful.

Gathering System Information

If you are able to arbitrarily execute commands on a system you can use standard OS commands to collect a variety of information about the system. The exact commands you use will vary based on the operating system you are interfacing with. Table 4-3 shows common commands that can yield a great deal of information from a system. Note that the command may be different depending on if it is run within the Linux or Windows environment.

Table 4-3. Local Data Gathering Commands
Linux Command Windows Git Bash Equivilent Purpose

uname -a

uname -a

Operating system version information

cat /proc/cpuinfo

systeminfo

Display system hardware and related info

ifconfig

ipconfig

Network interface information

route

route print

Display routing table

arp -a

arp -a

Display Address Resolution Protocol (ARP) table

netstat -a

netstat -a

Display network connections

mount

net share

Display file systems

ps -e

tasklist

Display running processes

The script getlocal.sh, below, is designed to identify the operating system type using osdetect.sh, run the various commands appropriate for the operating system type, and record the results to a file. The output from each command is stored in Extensible Markup Language (XML) format, i.e., delimited with XML tags, for easier processing later on. Invoke the script like this: bash getlocal.sh < cmds.txt where the file cmds.txt contains a list of commands similar to that shown in Table 4-3. The format it expects are those fields, separated by vertical bars, plus an additional field, the XML tag with which to mark the output of the command. (Also, lines beginning with a # are considered comments and will be ignored.)

Here is what a cmds.txt file might look like:

# Linux Command  |MSWin  Bash |XML tag    |Purpose
#----------------+------------+-----------+------------------------------
uname -a         |uname -a    |uname      |O.S. version etc
cat /proc/cpuinfo|systeminfo  |sysinfo    |system hardware and related info
ifconfig         |ipconfig    |nwinterface|Network interface information
route            |route print |nwroute    |routing table
arp -a           |arp -a      |nwarp      |ARP table
netstat -a       |netstat -a  |netstat    |network connections
mount            |net share   |diskinfo   |mounted disks
ps -e            |tasklist    |processes  |running processes

Here is the source for the script.

Example 4-3. getlocal.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# getlocal.sh
#
# Description:
# Gathers general system information and dumps it to a file
#
# Usage:
# bash getlocal.sh < cmds.txt
#   cmds.txt is a file with list of commands to run
#

# SepCmds - separate the commands from the line of input
function SepCmds()
{
      LCMD=${ALINE%%|*}                   11
      REST=${ALINE#*|}                    12
      WCMD=${REST%%|*}                    13
      REST=${REST#*|}
      TAG=${REST%%|*}                     14

      if [[ $OSTYPE == "MSWin" ]]
      then
         CMD="$WCMD"
      else
         CMD="$LCMD"
      fi
}

function DumpInfo ()
{                                                              5
    printf '<systeminfo host="%s" type="%s"' "$HOSTNAME" "$OSTYPE"
    printf ' date="%s" time="%s">\n' "$(date '+%F')" "$(date '+%T')"
    readarray CMDS                           6
    for ALINE in "${CMDS[@]}"                7
    do
       # ignore comments
       if [[ ${ALINE:0:1} == '#' ]] ; then continue ; fi     8

      SepCmds

      if [[ ${CMD:0:3} == N/A ]]             9
      then
          continue
      else
          printf "<%s>\n" $TAG               10
          $CMD
          printf "</%s>\n" $TAG
      fi
    done
    printf "</systeminfo>\n"
}

OSTYPE=$(./osdetect.sh)                     1
HOSTNM=$(hostname)                          2
TMPFILE="${HOSTNM}.info"                    3

# gather the info into the tmp file; errors, too
DumpInfo  > $TMPFILE  2>&1                  4
1

After the two function definitions the script begins here, invoking our osdetect.sh script (from a previous chapter). We’ve specified the current directory as its location. You could put it elsewhere but then be sure to change the specified path from ./ to wherever you put it and/or add that location to your PATH variable.

Note

To make things more efficient you can include the code from osdetect.sh directly in getlocal.sh.

2

Next we run the hostname program in a subshell to retrieve the name of this system for use in the next line but also later in the DumpInfo function.

3

We use the hostname as part of the temporary filename where we will put all our output.

4

Here is where we invoke the function that will do most of the work of this script. We redirect both stdout and stderr (to the same file) when invoking the function so that the function doesn’t have to put redirects on any of its output statements; it can write to stdout and this invocation will redirect all the output as needed. Another way to do this would have been to put the redirect on the closing brace of the DumpInfo function definition. Redirecting stdout might instead be left to the user who invokes this script; it would simply write to stdout by default. But if the user wants the output in a file, the user has to create a tempfile name and has to remember to redirect stderr as well. Our approach is suitable for a less experienced user.

5

Here is where the “guts” of the script begins. This function begins with some output of an XML tag called <systeminfo> which will have it’s closing tag written out at the end of this function.

6

The readarray command in bash will read all the lines of input (until end-of-file or on keyboard input until control-D). Each line will be its own entry in the array named, in this case, CMDS.

7

This for loop will loop over the values of the CMDS array, that is, over each line, one at a time.

8

This line uses the substring operation to take the character at position 0, of length 1, from the variable ALINE. The hashtag (or pound sign) is in quotes so that the shell doesn’t interpret it as the start of the script’s own comment.

If the line is not a comment, the script will call the SepCmds function. More about that function later; it separates the line of input into CMD and TAG, where CMD will the appropriate command for a Linux or MSWindows system depending on where we run the script.

9

Here again we use the substring operation from the start of the string (position 0) of length 3 to look for the string that indications that there is no appropriate operation on this particular operating system for the desired information. The continue statement tells bash to skip to the next iteration of the loop.

10

If we do have an appropriate action to take, this section of code will print the specified XML tag on either side of the invocation of the specified command. Notice that we just invoke the command by retrieving the value of the variable CMD.

11

Here we isolate the Linux command from a line of our input file by removing all the characters to the right of the vertical bar, including the bar itself. The %% says to make the longest match possible on the right side of the variable’s value and remove it from the value it returns (i.e., ALINE isn’t changed).

12

Here the # removes the shortest match and from the left hand side of the variable’s value. Thus, it removes the Linux command that was just put in LCMD.

13

Again we remove everything to the right of the vertical bar but this time we are working with REST, modified in the previous statement. This gives us the MSWindows command.

14

Here we extract the XML tag using the same substitution operations we’ve seen twice already.

All that’s left in this function is the decision, based on the operating system type, as to which value to return as the value in CMD. All variables are “global” unless explicitly declared as local within a function. None of ours are local, so they can be used (set, changed, or used) throughout the script.

When running this script you can use the cmds.txt file as shown or change its values to get whatever set of information you want to collect. You can also run it without redirecting the input from a file; simply type (or copy/paste) the input once the script is invoked.

Gathering the Windows Registry

The Windows Registry is a vast repository of settings that define how the system and applications will behave. Specific registry key values can often be used to identify the presence of malware and other intrusions. Because of that a copy of the registry is useful when later performing analysis of the system.

To export the entire Windows Registry to a file:

regedit //E ${HOSTNAME}_reg.bak

Note that two forward-slashes are used before the E option because we are calling regedit from Git Bash, only one would be needed if using the Windows command prompt. We use ${HOSTNAME} as part of the output file name to make it easier to organize later on.

If needed, the reg command can also be used to export sections of the registry or individual subkeys. To export the HKEY_LOCAL_MACHINE hive:

reg export HKEY_LOCAL_MACHINE $(uname -n)_hklm.bak

Searching the File System

The ability to search the system is critical for everything from organizing files, to incident response, to forensic investigation. The find and grep commands are extremely powerful and can be used to perform a variety of search functions.

Searching by Filename

Searching by filename is one of the most basic search methods. This is useful if the exact filename is known, or a portion of the filename is known. To search the /home directory and subdirectories for filenames containing the word password:

find /home -name '*password*'

Note the use of the * character at the beginning and end of the search string designates a wildcard, meaning it will match any (or no) characters. This is a shell pattern and is not the same as a regular expression. Additionally you can use the -iname option instead of -name to make the search case-insensitive.

Tip

If you want to suppress errors, such as Permission Denied, when using find you can do so by redirecting stderr to /dev/null or to a log file.

find /home -name '*password*' 2>/dev/null

Searching for Hidden Files

Hidden files are often interesting as they can be used by people or malware looking to avoid detection. In Linux, names of hidden files begin with a period. To find hidden files in the /home directory and subdirectories:

find /home -name '.*'
Tip

The .* in the example above is a shell pattern which is not the same as a regular expression. In the context of find the pattern provided will match on any file that begins with a period and is followed by any number of additional characters (denoted by the * wildcard character).

In Windows, hidden files are designated by a file attribute, not the filename. From the Windows command prompt you can identify hidden files on the c:\ drive by:

dir c:\ /S /A:H

The /S option tells dir to recursively traverse subdirectories and the /A:H displays files with the hidden attribute. Unfortunately Git Bash intercepts the dir command and instead executes ls, which means it cannot easily be run from bash. This can be solved by using the find command’s -exec option coupled with the Windows attrib command.

The find command has the ability run a specified command for each file that is found. To do that you can use the exec option after specifying your search criteria. Exec replaces any curly brackets ({}) with the pathname of the file that was found. The semicolon terminates the command expression.

$ find /c -exec attrib '{}' \; | egrep '^.{4}H.*'

A   H                C:\Users\Bob\scripts\hist.txt
A   HR               C:\Users\Bob\scripts\winlogs.sh

The find command will execute the Windows attrib command for each file it identifies on the c:\ drive (denoted as /c), thereby printing out each file’s attributes. The egrep command is then used with a regular expression to identify lines where the 5th character is the letter H, which will be true if the file’s hidden attribute is set.

If you want to clean up the output further and only display the file path you can do so by piping the output of egrep into the cut command.

$ find . -exec attrib '{}' \; | egrep '^.{4}H.*' | cut -c22-

C:\Users\Bob\scripts\hist.txt
C:\Users\Bob\scripts\winlogs.sh

The -c option tells cut to use character position numbers for slicing. 22- tells cut to begin at character 22, which is the beginning of the file path, and continue to the end of the line (-). This can be useful if you want to pipe the file path into another command for further processing.

Searching by File Size

The find command’s -size option can be used to find files based on file size. This can be useful to help identify unusually large files, or to identify the largest or smallest files on a system.

To search for files greater than 5 GB in size in the /home directory and subdirectories:

find /home -size +5G

To identify the largest files in the system you can combine find with a few other commands:

find / -type f -exec ls -s '{}' \; | sort -n -r | head -5

First we use find / -type f to list all of the files in and under the root directory. Each file is passed to ls -s which will identify its size in blocks (not bytes). The list is then sorted from highest to lowest, and the top five are displayed using head. To see the smallest files in the system tail can be used in place of head, or you can remove the reverse (-r) option from sort.

Tip

In the shell you can use !! to represent the last command that was executed. You can use this to execute a command again, or include it in a series of piped commands. For example, suppose you just ran the following command:

find / -type f -exec ls -s '{}' \;

You can then use !! to run that command again or feed it into a pipeline.

!! | sort -n -r | head -5

The shell will automatically replace !! with the last command that was executed.

Give it a try!

You can also use the ls command directly to find the largest file and completely eliminate the usage of find, which, is significantly more efficient. To do that just add the -R option for ls which will cause it to recursively list the files under the specified directory.

ls / -R -s | sort -n -r | head -5

Searching by Time

The file system can also be searched based on when files were last accessed or modified. This can be useful when investigating incidents to identify recent system activity. It can also be useful for malware analysis to identify files that have been accessed or modified during program execution.

To search for files in the /home directory and subdirectories modified less than 5 minutes ago:

find /home -mmin -5

To search for files modified less than 24 hours ago:

find /home -mtime -1

The number specified with the mtime option is a multiple of 24 hours, so 1 means 24 hours, 2 means 48 hours, etc. A negative number here means “less than” the number specified, a positive number means “greater than”, and an unsigned number means “exactly”.

To search for files modified more than 2 days, i.e., 48 hours, ago:

find /home -mtime +2

To search for files accessed less than 24 hours ago use the -atime option:

find /home -atime -1

To search for files in the /home directory accessed less than 24 hours ago and copy (cp) each file to the current working directory (./):

find /home -type f -atime -1 -exec cp '{}' ./ \;

The use of -type f tells find to match only ordinary files, ignoring directories and other special file types. You may also copy the files to any directory of your choosing by replacing the ./ with an absolute or relative path.

Warning

Warning: Be sure that your current working directory is not somewhere in the /home hierarchy or you will have the copies found and thus copied again.

Searching for Content

The grep command can be used to search for content inside of files. To search for files in the /home directory and subdirectories that contain the string password:

grep -r -i /home -e 'password'

The -r option recursively searches all directories below /home, -i specifies a case-insensitive search, and -e specifies the regex pattern string to search for.

Tip

The -n option can be used identify which line in the file the search string is found and -w can be used to only match whole words.

You can combine grep with find to easily copy matching files to your current working directory (or any specified directory):

find /home -type f -exec grep '{}' -e 'password' \; -exec cp '{}' ./ \;

First we use find /home/ -type f to identify all of the files in and below the /home directory. Each file found is passed to grep to search for password within its content. Each file matching the grep criteria is then passed to the cp command to copy the file to the current working directory (./). This combination of commands may take a considerable amount of time to execute and is a good candidate to run as a background task.

Searching by File Type

Searching a system for specific file types can be challenging. You cannot rely on the file extension, if one even exists, as that can be manipulated by the user. Thankfully the file command can help identify types by comparing the contents of a file to known patterns called Magic Numbers. Table 4-4 lists common Magic Numbers and their starting location inside of files.

Table 4-4. Magic Numbers
File Type Magic Number Pattern (Hex) Magic Number Pattern (Hex) File Offset (Bytes)

Jpeg

FF D8 FF DB

ÿØÿÛ

0

DOS Executable

4D 5A

MZ

0

Executable and Linkable Format

7F 45 4C 46

.ELF

0

Zip File

50 4B 03 04

PK..

0

To begin you need to identify the type of file for which you want to search. Lets assume you want to find all of the PNG image files on the system. First you would take a known good file such as Title.png, run it through the file command, and examine the output.

$ file Title.png

Title.png: PNG image data, 366 x 84, 8-bit/color RGBA, non-interlaced

As expected file identifies the known good Title.png file as PNG image data and also provides the dimensions and various other attributes. Based on this information you need to determine what part of the file command output to use for the search, and generate the appropriate regular expression. In many cases, such as with forensic discovery, you are likely better off gathering more information than less; you can always further filter the data later. To do that you will use a very broad regular expression that will simply search for the word PNG in the output from the file command.

.*PNG.*

You can of course make more advanced regular expressions to identify specific files. For example, if you wanted to find PNG files that have a dimension of 100 x 100:

.*PNG.* 100 x 100.*

If you want to find PNG and JPEG files:

.*(PNG|JPEG).*

Once you have the regular expression you can write a script to run the file command against every file on the system looking for a match. When a match is found typesearch.sh will print the file path to standard out.

Example 4-4. typesearch.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# typesearch.sh
#
# Description:
# Search the file system for a given file type. It prints out the
# pathname when found.
#
# Usage:
# typesearch.sh [-c dir] [-i] [-R|r] <pattern> <path>
#   -c Copy files found to dir
#   -i Ignore case
#   -R|r Recursively search subdirectories
#   <pattern> File type pattern to search for
#   <path> Path to start search
#

DEEPORNOT="-maxdepth 1"		# just the current dir; default

# PARSE option arguments:
while getopts 'c:irR' opt; do                         1
  case "${opt}" in                                    2
    c) # copy found files to specified directory
	       COPY=YES
	       DESTDIR="$OPTARG"                             3
	       ;;
    i) # ignore u/l case differences in search
	       CASEMATCH='-i'
	       ;;
    [Rr]) # recursive                                 4
        unset DEEPORNOT;;                             5
    *)  # unknown/unsupported option                  6
        # error mesg will come from getopts, so just exit
        exit 2 ;;
  esac
done
shift $((OPTIND - 1))                                 7


PATTERN=${1:-PDF document}                            8
STARTDIR=${2:-.}	# by default start here

find $STARTDIR $DEEPORNOT -type f | while read FN     9
do
    file $FN | grep -q $CASEMATCH "$PATTERN"          10
    if (( $? == 0 ))   # found one                    11
    then
	        echo $FN
	        if [[ $COPY ]]                               12
	        then
	            cp -p $FN $DESTDIR                       13
	        fi
    fi
done
1

This script supports options which alter its behavior, as described in the opening comments of the script. The script needs to parse these options to tell which ones have been provided and which are omitted. For anything more than a single option or two it makes sense to use the getopts shell built-in. With the while loop we will keep calling getopts until it returns a non-zero value, telling us that there are no more options. The options we want to look for are provided in that string c:irR. Whichever option is found is returned in, opt, the variable name we supplied.

2

We are using a case statement here which is a multi-way branch; it will take the branch that matches the pattern provided before the left parenthesis. We could have used an if/elif/else construct but this reads well and makes the options so clearly visible.

3

The c option has a : after it in the list of supported options which indicates to getopts that the user will also supply an argument for that option. For this script that optional argument is the directory into which copies will be made. When getopts parses an option with an argument like this it puts the argument in the variable named OPTARG and we save it in DESTDIR because another call to getopts may change OPTARG.

4

The script supports either a upper case R or lower case r for this option. Case statements specify a pattern to be matched, not just a simple literal, so we wrote [Rr]) for this case, using the brackets construct to indicate that either letter is considered a match.

5

The other options set variables to cause their action to occur. In this case we unset the previously set variable. When that variable is referenced later as $DEEPORNOT it will have no value so it will effectively disappear from the command line where it is used.

6

Here is another pattern, the asterisk, which matches anything. If no other pattern has been matched, this case will be executed. It is, in effect, an “else” clause for the case statement.

7

When we’re done parsing the options we can get rid of the ones we’ve already processed with a shift. Just a single shift gets rid of a single argument so that the second argument because the first, the third became the second, and so on. Specifying a number like shift 5 will get rid of the first 5 arguments so that $6 becomes $1, $7 becomes $2, and so on. Calls to getopts keep track of which arguments to process in the shell variable OPTIND. It refers to the next argument to be processed. By shifting by this amount we get rid of any/all of the options that we parsed. After this shift $1 will refer to the first non-option argument, whether or not any options were supplied when the user invoked the script.

8

The two possible arguments that aren’t -option format are the pattern we’re searching for and the directory where we want to start our search. When we refer to a bash variable we can add a :- to say “if that value is empty or unset then return this default value instead”. We give a default value for PATTERN as PDF document and the default for STARTDIR is . which refers to the current directory.

9

We invoke the find command telling it to start its search in $STARTDIR. Remember that $DEEPORNOT may be unset and thus add nothing to the command line or it may be the default -maxdepth 1 telling find not to go any deeper than this directory. We’ve added a -type f so that we only find plain files (not directories or special device files or FIFOs). That isn’t strictly necessary and you could remove it if you want to be able to search for those kinds of files. The names of the files found are piped in to the while loop which will read them one at a time into the variable FN.

10

The -q option to grep tells it to be quiet and not output anything. We don’t need to see what phrase it found, only that it found it.

11

The $? construct is the value returned by the previous command. A successful result means that grep found the pattern supplied.

12

This checks to see if COPY has a value. If it is null the if will be false.

13

The -p option to the cp command will preserve the mode, ownership and timestamps of the file, in case that information is important to your analysis.

If you are looking for a lighter weight but less capable solution you can perform a similar search using the find command’s exec option as seen in the example below.

find / -type f -exec file '{}' \; | egrep '.*PNG.*' | cut -d' ' -f1

Here we send each item found by the find command into file to identify its type. We then pipe the output of file into egrep and filter it looking for the PNG keyword. The use of cut is simply to clean up the output and make it more readable.

Warning

Be cautious if using the file command on an untrusted system. The file command uses the magic pattern file located at /usr/share/misc/. A malicious user could modify this file such that certain file types would not be identified. A better option is to mount the suspect drive to a known-good system and search from there.

Searching by Message Digest Value

A cryptographic hash function is a one-way function that transforms an input message of arbitrary length into a fixed length message digest. Common hash algorithms include MD5, SHA-1, and SHA-256. Take the following two files:

Example 4-5. hashfilea.txt
This is hash file A
Example 4-6. hashfileb.txt
This is hash file B

Notice that the files are identical except for the last letter in the sentence. You can use the sha1sum command to compute the SHA-1 message digest of each file.

$ sha1sum hashfilea.txt hashfileb.txt

6a07fe595f9b5b717ed7daf97b360ab231e7bbe8 *hashfilea.txt
2959e3362166c89b38d900661f5265226331782b *hashfileb..txt

Even though there was only a small difference between the two files they generated completely different message digests. Had the files been the same the message digests would have also been the same. You can use this property of hashing to search the system for a specific file if you know its digest. The advantage is that the search will not be influenced by the filename, location, or any other attributes; the disadvantage is that the files need to be exactly the same, if the file contents have changed in any way the search will fail.

Example 4-7. hashsearch.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# hashsearch.sh
#
# Description:
# Recursively search a given directory for a file that
# matches a given SHA-1 hash
#
# Usage:
# hashsearch.sh <hash> <directory>
#   hash - SHA-1 hash value to file to find
#   directory - Top directory to start search
#

HASH=$1
DIR=${2:-.}	# default is here, cwd

# convert pathname into an absolute path
function mkabspath ()				6
{
    if [[ $1 == /* ]]				7
    then
    	ABS=$1
    else
    	ABS="$PWD/$1"				8
    fi
}

find $DIR -type f |				1
while read fn
do
    THISONE=$(sha1sum "$fn")			2
    THISONE=${THISONE%% *}			3
    if [[ $THISONE == $HASH ]]
    then
	mkabspath "$fn"				4
	echo $ABS				5
    fi
done
1

We’ll look for any plain file for our hash. We need to avoid special files - reading a FIFO would cause our program to hang as it waited for someone to write into the FIFO. Reading a block special or character special file would also not be a good idea. The -type f assures that we only get plain files. It prints those filenames, one per line, to stdout which we redirect via a pipe into the while read commands.

2

This computes the hash value in a subshell and captures its output (i.e., whatever it writes to stdout) and assigns it to the variable. The quotes are needed in case the filename has spaces in its name.

3

This reassignment removes from the right hand side the largest substring beginning with a space. The output from sha1sum is both the computed hash and the filename. We only want the hash value, so we remove the filename with this substitution.

4

We call the mkabspath function putting the filename in quotes. The quotes make sure that the entire filename shows up as a single argument to the function, even if the filename has one or more spaces in the name.

5

Remember that shell variables are global unless declared to be local within a function. Therefore the value of ABS that was set in the call to mkabspath is available to us here.

6

This is our declaration of the function. When declaring a function you can omit either the keyword function or the parentheses but not both.

7

For the comparison we are using shell pattern matching on the right hand side. This will check to see if the first parameter begins with a slash. If it does, then this is already an absolute pathname and we need do nothing further.

8

When the parameter is only a relative path, it is relative to the current location so we pre-pend the current working directory thereby making it absolute. The variable PWD is a shell variable that is set to the current directory via the cd command.

Transferring Data

Once you have gathered all of the desired data, the next step is to move it off of the origin system for further analysis. To do that you can copy the data to a removable device or upload it to a centralized server. If you are going to upload the data be sure to do so using a secure method such as Secure Copy (SCP). The example below uses scp to upload the file some_system.tar.gz to the home directory of user bob on remote system 10.0.0.45.

scp some_system.tar.gz bob@10.0.0.45:/home/bob/some_system.tar.gz

For convenience you can add a line at the end of your collection scripts to automatically use scp to upload data to a specified host. Remember to give your files unique names as to not overwrite existing files and also make analysis easier later on.

Warning

Be cautious of how you perform SSH or SCP authentication within scripts. It is not recommended that you include passwords in your scripts. The preferred method is to use SSH certificates. The keys and certificates can be generated using the ssh-keygen command.

Summary

Gathering data is an important step in defensive security operations. When collecting data be sure to transfer and store it using secure methods (i.e. encrypted). As a general rule, gather all data that you think is relevant; you can easily delete data later, but you cannot analyze data you did not collect. Before collecting data, first confirm you have permission and/or legal authority to do so.

Also be aware that when dealing with adversaries, they will often try to hide their presence by deleting or obfuscating data. To counter that be sure to use multiple methods when searching for files (name, hash, contents, etc).

In the next chapter we will explore techniques for processing data and preparing it for analysis.

Exercises

  1. Write the command to search the file system for any file named dog.png.

  2. Write the command to search the file system for any file containing the text confidential.

  3. Write the command to search the file system for any file containing the text secret or confidential and copy the file to your current working directory.

  4. Write the command to execute ls -R / on the remote system 192.168.10.32 and write the output to a file named filelist.txt on your local system.

  5. Modify getlocal.sh to automatically upload the results to a specified server using SCP.

  6. Modify hashsearch.sh to have an option (-1) to quit after finding a match. If the option is not specified, it will keep searching for additional matches.

  7. Modify hashsearch.sh to simplify the full pathname that it prints out.

    1. If the string it output was /home/usr07/subdir/./misc/x.data modify it to remove the redundant ./ before printing it out.

    2. If the string was /home/usr/07/subdir/../misc/x.data modify it to remove the ../ and also the subdir/ before printing it out.

  8. Modify winlogs.sh to indicate its progress by printing the logfile name over the top of the previous logfile name. (Hint: use a return character rather than a newline)

  9. Modify winlogs.sh to show a simple progress bar of plus signs building from left to right. Use a separate invocation of wevtutil el to get the count of the number of logs and scale this to, say, a width of 60.

Chapter 5. Data Processing

In the previous chapter you gathered lots of data. Likely that data is in a variety of formats including free-form text, comma separated values (CSV), and Extensible Markup Language (XML). In this chapter we show you how to parse and manipulate that data so you can extract key elements for analysis.

Commands in Use

We introduce awk, join, sed, tail, and tr to prepare data for analysis.

awk

Awk is not just a command, but actually a programming language designed for processing text. There are entire books dedicated to this subject. Awk will be explained in more detail throughout this book, but here we provide just a brief example of its usage.

Common Command Options

-f

Read in the awk program from a specified file

Command Example

Take the file awkusers.txt:

Example 5-1. awkusers.txt
Mike Jones
John Smith
Kathy Jones
Jane Kennedy
Tim Scott

You can use awk to print each line where the user’s last name is Jones.

$ awk '$2 == "Jones" {print $0}' awkusers.txt

Mike Jones
Kathy Jones

Awk will iterate through each line of the input file reading in each word (separated by whitespace by default) into fields. Field $0 represents the entire line, $1 the first word, $2 the second word, etc. An awk program consists of patterns and corresponding code to be executed when that pattern is matched. In this example there is only one pattern. We test $2 to see if that field is equal to Jones. If it is, awk will run the code in the braces which, in this case, will print the entire line.

Note

If we left off the explicit comparison and instead wrote awk ' /Jones/ {print $0}' then the string inside the slashes is a regular expression to match anywhere in the input line. It would print all the names as before, but it would also find lines where Jones might be the first name or part of a longer name (such as “Jonestown”).

join

Join combines the lines of two files that share a common field. In order for join to function properly the input files must be sorted.

Common Command Options

-j

Join using the specified field number. Fields start at 1.

-t

Specify the character to use as the field separator. Space is the default field separator.

--header

Use the first line of each file as a header.

Command Example

Take the following files:

Example 5-2. usernames.txt
1,jdoe
2,puser
3,jsmith
Example 5-3. accesstime.txt
0745,file1.txt,1
0830,file4.txt,2
0830,file5.txt,3

Both files share a common field of data, which is the user ID. In accesstime.txt the user ID is in the third column. In usernames.txt the user ID is in the first column. You can merge these two files using join as follows:

$ join -1 3 -2 1 -t, accesstime.txt usernames.txt

1,0745,file1.txt,jdoe
2,0830,file4.txt,puser
3,0830,file5.txt,jsmith

The -1 3 option tells join to use the third column in the first file (accesstime.txt), and -2 1 specifies the first column in the second file (usernames.txt) for use when merging the files. The -t, option specifies the comma character as the field delimiter.

sed

Sed allows you to perform edits, such as replacing characters, on a stream of data.

Common Command Options

-i

Edit the specified file and overwrite in place

Command Example

The sed command is quite powerful and can be used for a variety of functions, however, replacing characters or sequences of characters is one of the most common. Take the file ips.txt:

Example 5-4. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

You can use sed to replace all of the instances of the 10.0.4.35 IP address with 10.0.4.27.

$ sed 's/10\.0\.4\.35/10.0.4.27/g' ips.txt

ip,OS
10.0.4.2,Windows 8
10.0.4.27,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS

In this example, sed uses the following format with each component separated by a forward slash:

s/<regular expression>/<replace with>/<flags/

The first part of the command (s) tells sed to substitute. The second part of the command (10\.0\.4\.35) is a regular expression pattern. The third part (10.0.4.27) is the value to use to replace the regex pattern matches. The forth part is optional flags, which in this case (g, for global) tells sed to replace all instances on a line (not just the first) that match the regex pattern.

tail

The tail command is used to output the last lines of a file. By default tail will output the last 10 lines of a file.

Common Command Options

-f

Continuously monitor the file and output lines as they are added

-n

Output the number lines specified

Command Example

To output the last line in the somefile.txt file:

$ tail -n 1 somefile.txt

12/30/2017 192.168.10.185 login.html

tr

The tr command is used to translate or map from one character to another. It is also often used to delete unwanted or extraneous characters. It only reads from stdin and writes to stdout so you typically see it with redirects for the input and output files.

Common Command Options

-d

delete the specified characters from the input stream

-s

squeeze, that is, replace repeated instances of a character with a single instance

Command Example

You can translate all the backslashes into forward slashes and all the colons to vertical bars with the tr command:

tr '\\:'  '/|' < infile.txt  > outfile.txt

If the contents of infile.txt looked like this:

drive:path\name
c:\Users\Default\file.txt

then after running the tr command, outfile.txt would contain this:

drive|path/name
c|/Users/Default/file.txt

The characters from the first argument are mapped to the corresponding characters in the second argument. Two backslashes are needed to specify a single backslash character because the backslash has a special meaning to tr; it is used to indicate special characters line newline \n or return \r or tab \t. You use the single quotes around the arguments to avoid any special interpretation by bash.

Tip

Files from Windows systems often come with both a Carriage Return and a Line Feed (CR & LF) character at the end of each line. Linux and macOS systems will have only the newline character to end a line. If you transfer a file to Linux and want to get rid of those extra return characters, here is how you might do that with the tr command:

tr -d '\r' < fileWind.txt  > fileFixed.txt

Conversely, you can convert Linux line endings to Windows line endings using sed:

$ sed -i 's/$/\r/' fileLinux.txt

The -i option makes the changes in place and writes them back to the input file.

Processing Delimited Files

Many of the files you will collect and process are likely to contain text, which makes the ability to manipulate text from the command line a critical skill. Text files are often broken into fields using a delimiter such as a space, tab, or comma. One of the more common formats is known as Comma Separated Values (CSV). As the name indicates, CSV files are delimited using commas, and fields may or may not be surrounded in double quotes ("). The first line of a CSV file is often the field headers. Here is an example:

Example 5-5. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4

To extract just the name from the file you can use cut by specifying the field delimiter as a comma and the field number you would like returned.

$ cut -d',' -f1 csvex.txt

"name"
"John Smith"
"Jane Smith"
"Bill Jones"

Note that the field values are still enclosed in double quotations. This may not be desirable for certain applications. To remove the quotations you can simply pipe the output into tr with its -d option.

$ cut -d',' -f1 csvex.txt | tr -d '"'

name
John Smith
Jane Smith
Bill Jones

You can further process the data by removing the field header using the tail command’s -n option.

$ cut -d',' -f1 csvex.txt | tr -d '"' | tail -n +2

John Smith
Jane Smith
Bill Jones

The -n +2 option tells tail to output the contents of the file starting at line number 2, thus removing the field header.

Tip

You can also give cut a list of fields to extract, such as -f1-3 to extract fields 1 through 3, or a list such as -f1,4 to extract fields 1 and 4.

Iterating Through Delimited Data

While you can use cut to extract entire columns of data, there are instances where you will want to process the file and extract fields line-by-line; in this case you are better off using awk.

Let’s suppose you want to check each user’s password hash in csvex.txt against the dictionary file of known passwords passwords.txt.

Example 5-6. csvex.txt
"name","username","phone","password hash"
"John Smith","jsmith","555-555-1212",5f4dcc3b5aa765d61d8327deb882cf99
"Jane Smith","jnsmith","555-555-1234",e10adc3949ba59abbe56e057f20f883e
"Bill Jones","bjones","555-555-6789",d8578edf8458ce06fbc5bb76a58c5ca4
Example 5-7. passwords.txt
password,md5hash
123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
welcome,40be4e59b9a2a2b5dffb918c0e86b3d7
ninja,3899dcbab79f92af727c2190bbd8abc5
abc123,e99a18c428cb38d5f260853678922e03
123456789,25f9e794323b453885f5181f1b624d0b
12345678,25d55ad283aa400af464c76d713c07ad
sunshine,0571749e2ac330a7455809c6b0e7af90
princess,8afa847f50a716e64932d995c8e7435a
qwerty,d8578edf8458ce06fbc5bb76a58c5c

You can extract each user’s hash from csvex.txt using awk as follows:

$ awk -F "," '{print $4}' csvex.txt

"password hash"
5f4dcc3b5aa765d61d8327deb882cf99
e10adc3949ba59abbe56e057f20f883e
d8578edf8458ce06fbc5bb76a58c5ca4

By default awk uses the space character as a field delimiter, so the -F option is used to identify a custom field delimiter (,) and then print out the forth field ($4) which is the password hash. You can then use grep to take the output from awk one line at a time and search for it in the passwords.txt dictionary file, outputting any matches.

$ grep "$(awk -F "," '{print $4}' csvex.txt)" passwords.txt

123456,e10adc3949ba59abbe56e057f20f883e
password,5f4dcc3b5aa765d61d8327deb882cf99
qwerty,d8578edf8458ce06fbc5bb76a58c5ca4

Processing by Character Position

If a file has fixed-width field sizes you can use the cut command’s -c option to extract data by character position. In csvex.txt the (U.S. 10-digit) phone number is an example of a fixed-width field.

$ cut -d',' -f3 csvex.txt | cut -c2-13 | tail -n +2

555-555-1212
555-555-1234
555-555-6789

Here you first use cut in delimited mode to extract the phone number at field 3. Since each phone number is the same number of characters you can use the cut character position option (-c) to extract the characters in between the quotations. Finally, tail is used to remove the file header.

Processing XML

Extensible Markup Language (XML) allows you to arbitrarily create tags and elements that describe data. Below is an example XML document.

Example 5-8. book.xml
<book title="Rapid Cybersecurity Ops" edition="1">
  <author>
    <firstName>Paul</firstName>
    <lastName>Troncone</lastName>
  </author>
  <author>
    <firstName>Carl</firstName>
    <lastName>Albing</lastName>
  </author>
</book>
1

This is a start tag that contains two attributes, also known as name/value pairs. Attribute values must always be quoted.

2

This is a start tag.

3

This is an element that has content.

4

This is an end tag.

For useful processing, you must be able to search through the XML and extract data from within the tags, which can be done using grep. Lets find all of the firstName elements. The -o option is used so only the text that matches the regex pattern will be returned, rather than the entire line.

$ grep -o '<firstName>.*<\/firstName>' book.xml

<firstName>Paul</firstName>
<firstName>Carl</firstName>

Note that the regex pattern above will only find the XML element if the start and end tags are on the same line. To find the pattern across multiple lines you need to make use of two special features. First, add the -z option to grep, which treats newlines like any ordinary character in its searching and adds a null (ASCII 0) at the end of each string it finds. Then add the -P option and (?s) to the regex pattern, which is a Perl-specific pattern match modifier. It modifies the . metacharacter to also match on the newline character.

$ grep -Pzo '(?s)<author>.*?<\/author>' book.xml

<author>
  <firstName>Paul</firstName>
  <lastName>Troncone</lastName>
</author><author>
  <firstName>Carl</firstName>
  <lastName>Albing</lastName>
</author>
Warning

The -P option is not available for all versions of grep including those included with macOS.

To strip the XML start and end tags and extract the content you can pipe your output into sed.

$ grep -Po '<firstName>.*?<\/firstName>' book.xml | sed 's/<[^>]*>//g'

Paul
Carl

The sed expression can be described as s/expr/other/ to replace (or substitute) some expression (expr) with something else (other). The expression can be just literal characters or a more complex regex. If an expression has no “other” portion, such as s/expr// then it replaces anything that matches the regular expression with nothing, essentially removing it. The regex pattern we use in the above example, namely the <[^>]*> expression, is a little confusing, so lets break it down.

< - The pattern begins with a literal less-than character <

[^>]* - Zero or more (indicated by the asterisk) characters from the set of characters inside the brackets; the first character is a ^ which means “not” any of the remaining characters listed. Here that’s just the solitary greater-than character, so [^>] matches any character that is not >

> - The pattern ends with a literal >

This should match a single XML tag, from its opening less-than to its closing greater-than character, but not more than that.

Processing JSON

JavaScript Object Notation (JSON) is another popular file format, particularly for exchanging data through Application Programming Interfaces (APIs). JSON is a simple format that consists of objects, arrays, and name/value pairs. Here is a sample JSON file:

Example 5-9. book.json
{
  "title": "Rapid Cybersecurity Ops",
  "edition": 1,
  "authors": [
    {
      "firstName": "Paul",
      "lastName": "Troncone"
    },
    {
      "firstName": "Carl",
      "lastName": "Albing"
    }
  ]
}
1

This is an object. Objects begin with { and end with }.

2

This is a name/value pair. Values can be a string, number, array, boolean, or null.

3

This is an array. Arrays begin with [ and end with ].

Tip

For more information on the JSON format visit http://json.org/

When processing JSON you are likely going to want to extract key/value pairs. To do that you can use grep. Lets extract the firstName key/value pair from book.json.

$ grep -o '"firstName": ".*"' book.json

"firstName": "Paul"
"firstName": "Carl"

Again, the -o option is used to return only the characters that match the pattern rather than the entire line of the file.

If you want to remove the key and only display the value you can do so by piping the output into cut, extracting the second field, and removing the quotations with tr.

$ grep -o '"firstName": ".*"' book.json | cut -d " " -f2 | tr -d '\"'

Paul
Carl

We will perform more advanced processing of JSON in a later chapter.

Aggregating Data

Data is often collected from a variety of sources, and in a variety of files and formats. Before you can analyze the data you must get it all into the same place and in a format that is conducive to analysis.

Suppose you want to search a treasure trove of data files for any system named ProductionWebServer. Recall that in previous scripts we wrapped our collected data in XML tags with the following format: '<systeminfo host="">. During collection we also named our files using the host name. You can now use either of those attributes to find and aggregate the data into a single location.

find /data -type f -exec grep '{}' -e 'ProductionWebServer' \;
-exec cat '{}' >> ProductionWebServerAgg.txt \;

The command find /data -type f lists all of the files in the /data directory and its subdirectories. For each file found, it runs grep looking for the string ProductionWebServer. If found, the file is appended (>>) to the file ProductionWebServerAgg.txt. Replace the cat command with cp and a directory location if you would rather copy all of the files to a single location rather than to a single file.

You can also use the join command to take data that is spread across two files and aggregate it into one. Take the two files seen in Example 5-10 and Example 5-11.

Example 5-10. ips.txt
ip,OS
10.0.4.2,Windows 8
10.0.4.35,Ubuntu 16
10.0.4.107,macOS
10.0.4.145,macOS
Example 5-11. user.txt
user,ip
jdoe,10.0.4.2
jsmith,10.0.4.35
msmith,10.0.4.107
tjones,10.0.4.145

The files share a common column of data, which is the IP addresses. Because of that the files can be merged using join.

$ join -t, -2 2 ips.txt user.txt

ip,OS,user
10.0.4.2,Windows 8,jdoe
10.0.4.35,Ubuntu 16,jsmith
10.0.4.107,macOS,msmith
10.0.4.145,macOS,tjones

The -t, option tells join that the columns are delimited using a comma, by default it uses a space character.

The -2 2 option tells join to use the second column of data in the second file (user.txt) as the key to perform the merge. By default join uses the first field as the key, which is appropriate for the first file (ips.txt). If you needed to join using a different field in ips.txt you would just add the option -1 n where n is replaced by the appropriate column number.

Warning

In order to use join both files must already be sorted by the column you will use to perform the merge. To do this you can use the sort command which is covered in Chapter 6.

Summary

In this chapter we explored ways to process common data formats including delimited, positional, JSON, and XML. The vast majority of data you collect and process will be in one of those formats.

In the next chapter we will look at how data can be analyzed and transformed into information that will provide insights into system status and drive decision making.

Exercises

  1. Given the file tasks.txt below, use the cut command to extract columns 1 (Image Name), 2 (PID), and 5 (Mem Usage).

    Image Name;PID;Session Name;Session#;Mem Usage
    System Idle Process;0;Services;0;4 K
    System;4;Services;0;2,140 K
    smss.exe;340;Services;0;1,060 K
    csrss.exe;528;Services;0;4,756 K
  2. Given the file procowner.txt below, use the join command to merge the file with tasks.txt.

    Process Owner;PID
    jdoe;0
    tjones;4
    jsmith;340
    msmith;528
  3. Use the tr command to replace all of the semicolon characters in tasks.txt with the tab character and print it to the screen.

  4. Write a command that extracts the first and last names of all of the authors in book.json.

Chapter 6. Data Analysis

In the previous chapters we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data it often helps to start broad and continually narrow the search as new insights are gained into the data.

In this chapter we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.

We will use an Apache web server access log for for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical log entry can be seen below. The full log file will be referenced as access.log in this book and can be downloaded at https://www.rapidcyberops.com.

Example 6-1. Sample from access.log
192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200
7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:56.0)
Gecko/20100101 Firefox/56.0"
Note

Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.

The Apache web server log fields are broken out in Table 6-1.

Table 6-1. Apache Web Server Combined Log Format Fields
Field Description Field Number

192.168.0.11

IP address of the host that requested the page

1

-

RFC 1413 Ident protocol identifier (- if not present)

2

-

The HTTP authenticated user ID (- if not present)

3

[12/Nov/2017:15:54:39 -0500]

Date, time, and GMT offset (timezone)

4 - 5

GET /request-quote.html

The page that was requested

6 - 7

HTTP/1.1

The HTTP protocol version

8

200

The status code returned by the web server

9

7326

The size of the file returned in bytes

10

http://192.168.0.35/support.html

The referring page

11

Mozilla/5.0 (Windows NT 6.3; Win64…

User agent identifying the browser

12+

Note that there is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See https://httpd.apache.org/docs/2.4/logs.html for additional information on the Apache log format and configuration.

The Hypertext Transfer Protocol (HTTP) status codes mentioned above are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 6-2:

Table 6-2. HTTP Status Codes
Code Description

200

OK

401

Unauthorized

404

Page Not Found

500

Internal Server Error

502

Bad Gateway

Tip

For a complete list of codes see the Hypertext Transfer Protocol (HTTP) Status Code Registry at https://www.iana.org/assignments/http-status-codes

Commands in use

We introduce sort, head, and uniq to limit the data we need to process and display. The following file will be used for command examples:

Example 6-2. file1.txt
12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

sort

The sort command is used to rearrange a text file into numerical and alphabetical order. By default sort will arrange lines in ascending order starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letter unless otherwise specified.

Common Command Options

-r

Sort in descending order

-f

Ignore case

-n

Use numerical ordering, so that 1,2,3 all sort before 10. (in the default alphabetic sorting, 2 and 3 would appear after 10.

-k

Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.

-o

Write output to a specified file.

Command Example

To sort file1.txt by the file name column and ignore the IP address column you would use the following:

sort -k 2 file1.txt

You can also sort on a subset of the field. To sort by the 2nd octet in the IP address:

sort -k 1.5,1.7 file1.txt

This will sort using characters 5 through 7 of the first field.

uniq

The uniq command filters out duplicate lines of data that occur adjacent to one another. To remove all duplicate lines in a file be sure to sort it before using uniq.

Common Command Options

-c

Print out the number of times a line is repeated.

-f

Ignore the specified number of fields before comparing. For example, -f 3 will ignore the first three fields in each line. Fields are delimited using spaces.

-i

Ignore letter case. By default uniq is case-sensitive.

Sorting and Arranging Data

When analyzing data for the first time it is often beneficial to start by looking at the extremes; the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data you can collect from web server log files. An unusually high number of page accesses could indicate scanning activity or a denial of service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.

To do that you can use the sort, head, and tail commands at the end of a pipeline such as:

…   | sort -k 2.1 -rn | head -15

which pipes the output of a script into the sort command and then pipes that sorted output into head that will print the top 15 (in this case) lines. The sort command here is using as its sort key (-k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse sort (-r) and the values will be sorted like numbers (-n). Why a numerical sort? so that 2 shows up between 1 and 3 and not between 19 and 20 (which is alphabetical order).

By using head we take the first lines of the output. We could get the last few lines by piping the output from the sort command into tail instead of head. Using tail -15 would give us the last 15 lines. The other way to do this would be to simply remove the -r option on sort so that it does an ascending rather than descending sort.

Counting Occurrences in Data

A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from you can gain a better understanding of general site activity. Interesting entries can include:

  • A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.

  • A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.

  • A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.

To detect this type of activity we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this we will use the cut command to extract the field and then pipe the output into our new tool countem.sh.

Example 6-3. countem.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# countem.sh
#
# Description:
# Count the number of instances of an item using bash
#
# Usage:
# countem.sh < inputfile
#

declare -A cnt        # assoc. array             1
while read id xtra                               2
do
    let cnt[$id]++                               3
done
# now display what we counted
# for each key in the (key, value) assoc. array
for id in "${!cnt[@]}"                           4
do
    printf '%d %s\n'  "${cnt[$id]}"  "$id"       5
done

And here is another version, this time using awk:

Example 6-4. countem.awk
# Rapid Cybersecurity Ops
# countem.awk
#
# Description:
# Count the number of instances of an item using awk
#
# Usage:
# countem.awk < inputfile
#

awk '{ cnt[$1]++ }
END { for (id in cnt) {
        printf "%d %s\n", cnt[id], id
      }
    }'
1

Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array, declared here with the -A option, so that we can use whatever string we read as our index.

The associative array feature of bash found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you’re using something older than bash 4.0, Example 6-4 is an alternate script that uses awk instead.

The array references are like others in bash, using the ${var[index]} syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}

2

While we only expect one word of input per line, we put the variable xtra there to capture any other words that appear on the line. Each variable on a read command gets a assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable get the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than their are variables on the read command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra but if there are no extra words then xtra will be given the value of the null string (which won’t matter either way because we don’t use it.)

3

Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.

4

This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed -it has to do with the hashing algorithm for the index values, so it is not guaranteed to be in any order such as alphabetical order.

5

In printing out the value and key we put the values inside quotes so that we always get a single value for each argument - even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.

Both will work nicely in a pipeline of commands like this:

cut -d' ' -f1 logfile | bash countem.sh

or (see note 2 above) just:

bash countem.sh < logfile

For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (page not found) error:

$ awk '$9 == 404 {print $1}' access.log | bash countem.sh

1 192.168.0.36
2 192.168.0.37
1 192.168.0.11

You can also use grep 404 access.log and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g. the byte count, or part of a file path). The use of awk here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.

To begin analysis of the example access.log file you can start by looking at the hosts that accessed the web server. You can use the Linux cut command to extract the first field of the log file, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is seen below.

$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26
Tip

If you do not have countem.sh available you can use the uniq command -c option to achieve similar results, but it will require an extra pass through the data using sort to work properly.

$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26

Next, you can further investigate by looking at the host that had the most number of requests, which as can be seen above is IP address 192.168.0.37 with 111. You can use awk to filter on the IP address, then pipe that into cut to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488

The activity of this particular host is unimpressive, appearing to be standard web browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png

This output indicates that host 192.168.0.36 accessed nearly every page on the website exactly one time. This type of activity often indicates webcrawler or site cloning activity. If you take a look at the user agent string provided by the client it further verifies this conclusion.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq

"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

The user agent identifies itself as HTTrack, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.

Tip

You can find additional information on HTTrack at http://www.httrack.com.

Totaling Numbers in Data

Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address - or which IP addresses have requested and received the most data?

The solution is not that much different than countem.sh - you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data).

The pipeline to invoke this will now extract two fields from the logfile, the first and the last.

cut -d' ' -f 1,10 access.log | bash summer.sh
Example 6-5. summer.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# summer.sh
#
# Description:
# Sum the total of field 2 values for each unique field 1
#
# Usage:
# Input Format - <input field> <number>
#

declare -A cnt        # assoc. array
while read id count
do
  let cnt[$id]+=$count
done
for id in "${!cnt[@]}"
do
    printf "%-15s %8d\n"  "${id}"  "${cnt[${id}]}" 1
done
1

Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left justified (via the minus sign) and 8 digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.

You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this use cut to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

192.168.0.36     4371198
192.168.0.37     2575030
192.168.0.11     2537662
192.168.0.14     2876088
192.168.0.26      665693

These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host the next step would be to review the specific pages and files accessed by the suspicious host to try and classify it as malicious or benign.

Displaying Data in a Histogram

You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.

The script to do the printing will take the first field as the index to an associative array; the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the list.

Example 6-6. histogram.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram.sh
#
# Description:
# Generate a horizontal bar chart of specified data
#
# Usage:
# Data input format - <label> <value>
#

function pr_bar ()                            1
{
    local -i i raw maxraw scaled              2
    raw=$1
    maxraw=$2
    ((scaled=(MAXBAR*raw)/maxraw))            3
    # min size guarantee
    ((raw > 0 && scaled == 0)) && scaled=1				4

    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '\n'

} # pr_bar

#
# "main"
#
declare -A RA						5
declare -i MAXBAR max
max=0
MAXBAR=50	# how large the largest bar should be

while read labl val
do
    let RA[$labl]=$val					6
    # keep the largest value; for scaling
    (( val > max )) && max=$val
done

# scale and print it
for labl in "${!RA[@]}"					7
do
    printf '%-20.20s  ' "$labl"
    pr_bar ${RA[$labl]} $max				8
done
1

We define a function to draw a single bar of the histogram. This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script. We will be reusing this function in a future script so we could have put it in a separate file and included it here with a source command - but we didn’t.

2

We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere). We declare all these variables as integer (that’s the -i option) because we are only going to compute values with them and not use them as strings.

3

The computation is done inside double-parentheses and inside those we don’t need to use the $ to indicate “the value of” each variable name.

4

This is an “if-less” if statement. If the expression inside the double-parentheses is true then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled is never zero when the raw value is non-zero. Why? Because we’d like something to show up in that case.

5

The main part of the script begins with a declaration of the RA array as an associative array.

6

Here we reference the associative array using the label, a string, as its index.

7

Since the array isn’t index by numbers, we can’t just count integers and use them as indices. This contruct gives all the various strings that were used as an index to the array, one at a time, in the for loop.

8

We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar function.

Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort, or you could take a slightly different approach.

Here’s a version of the histogram script that preserves order - by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown as the function pr_bar remains the same.

Example 6-7. histogram_plain.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram_plain.sh
#
# Description:
# Generate a horizontal bar chart of specified data without
# using associative arrays, good for older versions of bash
#
# Usage:
# Data input format - <label> <value>
#

declare -a RA_key RA_val                                 1
declare -i max ndx
max=0
maxbar=50    # how large the largest bar should be

ndx=0
while read labl val
do
    RA_key[$ndx]=$labl                                   2
    RA_value[$ndx]=$val
    # keep the largest value; for scaling
    (( val > max )) && max=$val
    let ndx++
done

# scale and print it
for ((j=0; j<ndx; j++))                                  3
do
    printf "%-20.20s  " ${RA_key[$j]}
    pr_bar ${RA_value[$j]} $max
done

This version of the script avoids the use of associative arrays - in case you are running an older version of bash (prior to 4.x), such as on MacOS systems. For this version we use two separate arrays, one for the index value and one for the counts. Since they are normal arrays we have to use an integer index and so we will keep a simple count in the variable ndx.

1

Here the variable names are declared as arrays. The lower-case a says that they are arrays, but not of the “associative” variety. While not strictly necessary, it is good practice.

2

The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle” - that is, easily broken, if changes to the script ever got the two arrays out of sync.

3

Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to ndx. The variable j is used here so as not to interfere with the index in the for looop inside pr_bar although we were careful enough inside the function to declare its version of i as local to the function. Do you trust it? Change the j to an i here and see if it still works (It does). Then try removing the local declaration and see if it fails (It does).

This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data you can retrieve them in order they were read in - in the numerical order of the index.

You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh and then into histogram.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh

192.168.0.36          ##################################################
192.168.0.37          #############################
192.168.0.11          #############################
192.168.0.14          ################################
192.168.0.26          #######

While this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.

In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that you can use the summer.sh script, but due to the format of the access.log file you need to do a little more processing before you can pipe it into the script. If you use cut to extract the date/time and bytes transferred fields you are left with data that causes some problems for the script.

$ cut -d' ' -f4,10 access.log

[12/Nov/2017:15:52:59 2377
[12/Nov/2017:15:52:59 4529
[12/Nov/2017:15:52:59 1112

As seen in the output above, the raw data starts with a [ character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that you can use an additional iteration of the cut command to remove the character using -c2- as an option. This option tells cut to extract the data by character starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed can be seen below.

$ cut -d' ' -f4,10 access.log | cut -c2-

12/Nov/2017:15:52:59 2377
12/Nov/2017:15:52:59 4529
12/Nov/2017:15:52:59 1112
Tip

Alternatively, you can use tr in place of the second cut. The -d option will delete the character specified, in this case the square bracket.

cut -d' ' -f4,10 access.log | tr -d '['

You also need to determine how you want to group the time-bound data; by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut iteration. The table below illustrates the cut option to use to extract various forms of the date/time field. Note that these cut options are specific to Apache log files.

Table 6-3. Apache Log Date/Time Field Extraction
Date/Time Extracted Example Output Cut Opton

Entire date/time

12/Nov/2017:19:26:09

-c2-

Month, Day, and Year

-c2-12

Month and year

Nov/2017

-c5-12,22-

Full Time

19:26:04

-c14-

Hour

19

-c14-15,22-

Year

The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is only accessed during working hours of 9:00 AM to 5:00 PM, you can review the server log file on a daily basis using the histogram view and see if there are any spikes in activity outside of normal working hours. Large spikes of activity or data transfer outside of normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected you can filter the data by that particular date and time and review the page accesses to determine if the activity is malicious.

For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis you can do the following:

$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 |
cut -c14-15,22- | bash summer.sh | bash histogram.sh

17              ##
16              ###########
15              ############
19              ##
18              ##################################################

Here the access.log file is sent through awk to extract the entries from a particular date. Note the use of the like operator (~) in stead of == since field 4 also contains time information. Those entries are piped into cut to extract the date/time and bytes transferred fields, and then piped into cut again to extract just the hour. From there it is summed by hour using summer.sh and converted into a histogram using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.

Finding Uniqueness in Data

Previously IP address 192.168.0.37 was identified as the system that had the largest number of page requests. The next logical question is what pages did this system request? With that answer you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that you can use awk and cut and pipe the output into countem.sh.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 |
bash countem.sh | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

While this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and only requires a single pass over the data.

Example 6-8. pagereq.sh
# Rapid Cybersecurity Ops
# pagereq.sh
#
# Description:
# Count the number of page requests for a given IP address using bash
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

declare -A cnt                                             1
while read addr d1 d2 datim gmtoff getr page therest
do
    if [[ $1 == $addr ]] ; then let cnt[$page]+=1 ; fi
done
for id in ${!cnt[@]}                                       2
do
    printf "%8d %s\n" ${cnt[$id]} $id
done
1

We declare cnt as an associative array (also known as a hash table or dictionary) so that we can use a string as the index to the array. In this program we will be using the page address (the URL) as the index.

2

The ${!cnt[@]} results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.

Early versions of bash don’t have associative arrays. You can use awk to do the same thing - counting the various page requests from a particular ip address - since awk has associative arrays.

Example 6-9. pagereq.awk
# Rapid Cybersecurity Ops
# pagereq.awk
#
# Description:
# Count the number of page requests for a given IP address using awk
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

# count the number of page requests from an address ($1)
awk -v page="$1" '{ if ($1==page) {cnt[$7]+=1 } }                1
END { for (id in cnt) {                                          2
    printf "%8d %s\n", cnt[id], id
    }
}'
1

There are two very different $1 variables on this line. The first $1 is a shell variable and refers to the first argument supplied to this script when it is invoked. The second $1 is an awk variable. It refers to the first field of the input on each line. The first $1 has been assigned to the awk variable page so that it can be compared to each $1 of awk - that is, to each first field of the input data.

2

This simple syntax results in the varialbe id iterating over the values of the index values to the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with the same effect.

You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input.

$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

Identifying Anomalies in Data

On the web a User Agent String is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user agent string:

Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

This user agent string identifies the system as: Windows NT version 6.3 (aka Windows 8.1); 64-bit architecture; and using the Firefox browser.

The user agent string is interesting for a few reasons: first because of the significant amount of information it conveys, which, can be used to identify the types of systems and browsers accessing the server; second because it is configurable by the end user, which, can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e. a webcrawler).

You can identify unusual user agents by first compiling a list of known good user agents. For the purposes of this exercise we will use a very small list that is not specific to a particular version.

Example 6-10. useragents.txt
Firefox
Chrome
Safari
Edge
Tip

For a list of common user agent strings visit https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found it should be considered an anomaly and printed to standard out along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.

Example 6-11. useragents.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# useragents.sh
#
# Description:
# Read through a log looking for unknown user agents
#
# Usage:
# useragents.txt < <inputfile>
#   <inputfile> Apache access log
#


# mismatch - search through the array of known names
#  returns 1 (false) if it finds a match
#  returns 0 (true) if there is no match
function mismatch ()                                    1
{
    local -i i                                          2
    for ((i=0; i<$KNSIZE; i++))
    do
        [[ "$1" =~ .*${KNOWN[$i]}.* ]] && return 1      3
    done
    return 0
}

# read up the known ones
readarray -t KNOWN < "useragents.txt"                      4
KNSIZE=${#KNOWN[@]}                                     5

# preprocess logfile (stdin) to pick out ipaddr and user agent
awk -F'"' '{print $1, $6}' | \
while read ipaddr dash1 dash2 dtstamp delta useragent   6
do
    if mismatch "$useragent"
    then
        echo "anomaly: $ipaddr $useragent"
    fi
done
1

We will use a function for the core of this script. It will return a success (or “true”) if it finds a mismatch, that is, if it finds no match against the list of known user agents. This logic may seem a bit inverted, but it makes the if statement containing the call to mismatch read clearly.

2

Declaring our for loop index as a local variable is good practice. It’s not strictly necessary in this script but is a good habit.

3

There are two strings to compare - the input from the logfile and a line from the list of known user agents. To make for a very flexible comparison we use the regex comparison operator (the =~). The .* (meaning “zero or more instances of any character”) placed on either side of the $KNOWN array reference means that the known string can appear anywhere within the other string for a match.

4

Each line of the file is added as an element to the array name specified. This gives us an array of known user agents. There are two identical ways to do this in bash either readarray, as used here, or mapfile. The -t option removes the trailing newline from each line read. The file containing the list of known user agents is specified here; modify as needed.

5

This computes the size of the array. It is used inside the mismatch function to loop through the array. We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.

6

The input string is a complex mix of words and quote marks. To capture the user agent string we use the double-quote as the field separator. Doing that, however, means that our first field contains more than just the ip address. By using the bash read we can parse on the spaces to get the ip address. The last argument of the read takes all the remaining words and so it can capture all the several words of the user agent string.

Summary

In this chapter we looked at techniques to analyze the content of log files by identifying unusual and anomolous activity. This type of analysis can provide you with insights into what occurred in the past. In the next chapter we will look at how to analyze log files and other data to provide insights into what is happening in the system in real time.

Exercises

  1. The example use of summer.sh used cut to print the 1st and 10th fields of the access.log file, like this:

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

Replace the cut command by using the awk command. Do you get the same results? What might be different about those two approaches?

  1. Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:

    192.168.0.37          #############################    2575030
    192.168.0.26          ####### 665693
  2. Expand the histogram.sh script to allow the user to supply the option -s that specifies the maximum bar size. For example histogram.sh -s 25 would limit the maximum bar size to 25 # characters. The default should remain at 50 if no option is given.

  3. Download the following web log file TODO: Add Log File URL.

    1. Which IP address made the most number of requests?

    2. Which page was accessed the most number of times?

  4. Download the following Domain Name System (DNS) server log TODO: Add Log File URL

    1. What was the most requested domain?

    2. What day had the most number of requests?

  5. Modify the useragents.sh script to add some parameters

    1. Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts as it currently is used.

    2. Add code for a -f option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.

  6. Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the ip address into a 10-12 digit number for that use. Caution: don’t have leading zeros on the number or the shell will attempt to interpret it as an octal number. Example: convert “10.124.16.3” into “10124016003” which can be used as a numerical index.

Chapter 7. Real-Time Log Monitoring

The ability to analyze a log after an event is an important skill. It is equally important to be able to extract information from a log file in real-time to detect malicious or suspicious activity as it happens. In this chapter we will explore methods to read in log entries as they are generated, format them for output to the analyst, and generate alerts based on known indicators of compromise.

Monitoring Text Logs

The most basic method to monitor a log in real time is to use the tail command’s -f option, which continuously reads a file and outputs new lines to stdout as they are added. As in previous chapters, we will use an Apache web server access log for examples, but the techniques presented can be applied to any text-based log. To monitor the Apache access log with tail:

tail -f /var/logs/apache2/access.log

Commands can be combined to provide more advanced functionality. The output from tail can be piped into grep so only entries matching specific criteria will be output. The example below monitors the Apache access log and outputs entries matching a particular IP address.

tail -f /var/logs/apache2/access.log | grep '10.0.0.152'

Regular expressions can also be used. Below only entries returning a HTTP status code of 404 Page Not Found will be displayed. The -i option is added to ignore character case.

tail -f /var/logs/apache2/access.log | egrep -i 'HTTP/.*" 404'

To clean up the output it can be piped into the cut command to remove extraneous information. The example below monitors the access log for requests resulting in a 404 status code and then uses cut to only display the date/time and the page that was requested.

$ tail -f access.log | egrep --line-buffered 'HTTP/.*" 404' | cut -d' ' -f4-7

[29/Jul/2018:13:10:05 -0400] "GET /test
[29/Jul/2018:13:16:17 -0400] "GET /test.txt
[29/Jul/2018:13:17:37 -0400] "GET /favicon.ico

You can further clean the output by piping it into tr -d '[]"' to remove the square brackets and the orphen double-quotation.

Note that we used the egrep command’s --line-buffered option. This forces egrep to output to stdout each time a line break occurs. Without this option buffering occurs and output is not piped into cut until a buffer is filled. We don’t want to wait that long. The option will have egrep write out each line as it finds it.

Log-Based Intrusion Detection

You can use the power of tail and egrep to monitor a log and output any entries that match known patterns of suspicious or malicious activity, often referred to as Indicators of Compromise (IOC). By doing this you can create a lightweight Intrusion Detection System (IDS). To begin lets create a file that contains regex patterns for IOCs.

Example 7-1. ioc.txt
\.\./ 1
etc/passwd 2
etc/shadow
cmd\.exe 3
/bin/sh
/bin/bash
1

This pattern (../) is an indicator of a directory traversal attack where the attacker tries to escape from the current working directory and access files for which they otherwise would not have permission.

2

The Linux etc/passwd and etc/shadow files are used for system authentication and should never be available through the web server.

3

Serving the cmd.exe, /bin/sh, or /bin/bash files is an indicator of a reverse shell being returned by the web server. A reverse shell is often an indicator of a successful exploitation attempt.

Note that the IOCs must be in a regular expression format as they will be used later with egrep.

Tip

IOCs for web servers are too numerous to discuss here in depth. For more examples of indicators of compromise download the latest Snort community ruleset at https://www.snort.org/downloads.

Next ioc.txt can be used with the egrep -f option. This option tells egrep to read in the regex patterns to search for from the specified file. This allows you to use tail to monitor the log file, and as each entry is added it will be compared against all of the patterns in the IOC file, outputting any entry that matches. Here is an example:

tail -f /var/logs/apache2/access.log | egrep -i -f ioc.txt

Additionally, the tee command can be used to simultaneously display the alerts to the screen and save them to their own file for later processing.

tail -f /var/logs/apache2/access.log | egrep --line-buffered -i -f ioc.txt |
tee -a interesting.txt

Again the --line-buffered option is used to ensure there are no problems caused by command output buffering.

Monitoring Windows Logs

As previously discussed, you need to use the wevtutil command to access Windows events. While the command is very versatile, it does not have functionality similar to tail that can be used to extract new entries as they occur. Thankfully, a simple bash script can provide similar functionality.

Example 7-2. wintail.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# wintail.sh
#
# Description:
# Perform a tail-like function on a Windows log
#

WINLOG="Application"  1

LASTLOG=$(wevtutil qe "$WINLOG" //c:1 //rd:true //f:text)  2

while true
do
	CURRENTLOG=$(wevtutil qe "$WINLOG" //c:1 //rd:true //f:text)  3
	if [[ "$CURRENTLOG" != "$LASTLOG" ]]
	then
		echo "$CURRENTLOG"
		echo "----------------------------------"
		LASTLOG="$CURRENTLOG"
	fi
done
1

This variable identifies the Windows log you want to monitor. You can use wevtutil el to obtain a list of logs currently available on the system.

2

This executes the wevtutil command to query the specified log file. The c:1 parameter causes it to return only one log entry. The rd:true parameter causes the command to read the most recent log entry. Finally, f:text returns the result as plain text rather than XML which makes it easy to read from the screen.

3

The next few lines execute the wevtutil command again and compare the latest log entry to the last one printed to the screen. If the two are different, meaning that a new entry was added to the log, it prints the entry to the screen. If they are the same nothing happens and it loops back and checks again.

Generating a Real-Time Histogram

A tail -f provides an ongoing stream of data. What if you wanted to count how many lines are added to a file during a time interval? You could observe that stream of data, start a timer, and begin counting until a specified time interval is up; then you can stop counting and report the results.

You might divide this work into two separate processes - two separate scripts - one to count the lines and another to watch the clock. The timekeeper will notify the line counter by means of a standard POSIX inter-process communication mechanism called a “signal”. A signal is a software interrupt and there are different kinds of such interrupts. Some are fatal, that is, will cause the process to terminate (e.g., a floating point exception). Most can be ignored or caught - and an action taken when the signal is caught. Many have a predefined purpose, used by the operating system. We’ll use one of the two signals available for users, SIGUSR1. (The other is SIGUSR2.)

Shell scripts can catch the catchable interrupts with the trap command, a shell built-in command. With trap you specify a command to indicate what action you want taken and a list of signals which trigger the invocation of that command. For example:

trap warnmsg SIGINT

will cause the command warnmsg (our own script or function) to be called whenever the shell script receives a SIGINT signal, as when you type a control-C to interrupt a running process.

Here is the script that performs the count.

Example 7-3. looper.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# looper.sh
#
# Description:
# Count the lines in a file being tailed -f
# Report the count interval on every SIGUSR1
#


function interval ()					1
{
    echo $(date '+%y%m%d %H%M%S') $cnt			2
    cnt=0
}

declare -i cnt=0
trap interval SIGUSR1					3

shopt -s lastpipe					4

tail -f --pid=$$ ${1:-log.file} | while read aline	5
do
    let cnt++
done
1

The function interval will be called on each signal. We define it here. It needs to be defined before we can call it, of course, but also before we can use it in our trap statement, below.

2

The date command is called to provide a timestamp for the count value that we print out. After we print the count we reset its value to 0 to start the count for the next interval.

3

Now that interval is defined, we can tell bash to call the function whenever our process receives a SIGUSR1 signal.

4

This is a crucial step. Normally when there is a pipeline of commands (such as ls -l | grep rwx | wc) then those pieces of the pipeline (each command) are run in subshells and they each end up with their own process id. This would be a problem for this script because the while loop would be in a subshell, with a different process id. Whatever process started the looper.sh script wouldn’t know the process id of the while loop to send the signal to it. Moreover, changing the value of the cnt variable in the subshell doesn’t change the value of cnt in the main process, so a signal to the main process would result in a value of zero every time. The solution is this shopt command that sets (-s) the shell option lastpipe. That option tells the shell not to create a subshell for the last command in a pipeline but to run that command in the same process as the script itself. In our case that means that the tail will run in a subshell (i.e., a different process) but the while loop will be part of the main script process. Caution: this shell option is only available in bash 4.x and above, and is only for non-interactive shells (i.e., scripts).

5

Here is the tail -f command with one more option, the --pid option. We specify a process id to tell tail to exit when that process dies. We are specifying $$, the current shell script’s process id, as the one to watch. This is useful for cleanup so that we don’t get tail commands left running in the background (if, for example, this script is run in the background; see the next script which does just that.)

The script tailcount.sh starts and stops the counting, the script that has the “stopwatch” so to speak, and times these intervals.

Example 7-4. tailcount.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# tailcount.sh
#
# Description:
# Count lines every n seconds
#

# cleanup - the other processes on exit
function cleanup ()
{
  [[ -n $LOPID ]] && kill $LOPID		1
}

trap cleanup EXIT 				2

bash looper.sh $1 &				3
LOPID=$!					4
# give it a chance to start up
sleep 3

while true
do
    kill -SIGUSR1 $LOPID
    sleep 5
done >&2					5
1

Since this script will be starting other processes (other scripts) it should clean up after itself. If the process id has been stored in LOPID the variable will be non-empty and therefore the function will send a signal via the kill command to that process. By not specifying a particular signal on the kill command, the default signal to be sent is SIGTERM.

2

Not a signal, EXIT is a special case for the trap statement to tell the shell to call this function (here, cleanup) when the shell that is running this script is about to exit.

3

Now the real work begins. The looper.sh script is called but is put in the “background”, that is, detached from the keyboard to run on its own while this script continues (without waiting for looper.sh to finish).

4

This saves the process id of the script that we just put in the background.

5

This redirection is just a precaution. By redirecting stdout into stderr then any and all output coming from the while loop or the kill or sleep statements (though we’re not expecting any) will be sent to stderr and not get mixed in with any output coming from looper.sh which, though it is in the background, still writes to stdout.

In summary, looper.sh has been put in the background and its process id saved in a shell variable. Every 5 seconds this script (tailcount.sh) sends that process (which is running looper.sh) a SIGUSR1 signal which causes looper.sh to print out its current count and restart its counting. When tailcount.sh exits it will clean up by sending a SIGTERM to the looper.sh function so that it, too, will be terminated.

With both a script to do the counting and a script to drive it with its “stopwatch”, you can use their output as input to a script that prints out a histogram-like bar to represent the count. It is invoked as follows:

bash tailcount.sh | bash livebar.sh

The livebar.sh script reads from stdin and prints its output to stdout, one line for each line of input.

Example 7-5. livebar.sh
#!/bin/bash -
#
# Rapid Cybersecurity Ops
# livebar.sh
#
# Description:
# Creates a rolling horizontal bar chart of live data
#
# Usage:
# <output> | bash livebar.sh
#

function pr_bar ()					1
{
    local raw maxraw scaled
    raw=$1
    maxraw=$2
    ((scaled=(maxbar*raw)/maxraw))
    ((scaled == 0)) && scaled=1		# min size guarantee
    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '\n'

} # pr_bar


maxbar=60   # largest no. of chars in a bar		2
MAX=60
while read dayst timst qty
do
    if (( qty > MAX ))					3
    then
	let MAX=$qty+$qty/4	# allow some room
	echo "              **** rescaling: MAX=$MAX"
    fi
    printf '%6.6s %6.6s %4d:' $dayst $timst $qty	4
    pr_bar $qty $MAX
done
1

The pr_bar function prints the bar of hashtags scaled to the maximum size based on the parameters supplied. This function might look familiar. We’re using the same function we used in histogram.sh in the previous chapter.

2

This is the longest string of hastags we will allow on a line (to avoid line wrap).

3

How large will the values be that need to be displayed? Not knowing before hand (although it could be supplied as an argument to the script) the script will, instead, keep track of a maximum. If that maximum is exceeded it will “rescale” and the current and future lines will be scaled to the new maximum. The script adds 25% onto the maximum so that it doesn’t need to rescale if each new value goes up by just one or two each time.

4

The printf specifies a min and max width on the first two fields that are printed. They are date and time stamps and will be truncated if they exceed those widths. You wouldn’t want the count truncated so we specify it will be 4 digits wide but the entire value will be printed regardless. If it is smaller than 4 it will be padded with blanks.

Since this script reads from stdin you can run it by itself to see how it behaves. Here’s a sample:

$ bash livebar.sh
201010 1020 20
201010   1020   20:####################
201010 1020 70
              **** rescaling: MAX=87
201010   1020   70:################################################
201010 1020 75
201010   1020   75:###################################################
^C

In this example the input is mixing with the output. You could also put the input into a file and redirect it into the script to see just the output.

$ bash livebar.sh < testdata.txt
bash livebar.sh < x.data
201010   1020   20:####################
              **** rescaling: MAX=87
201010   1020   70:################################################
201010   1020   75:###################################################
$

Summary

Log files can provide tremendous insight into the operation of a system, but they also come in large quantities which makes them challenging to analyze. You can minimize this issue by creating a series of scripts to automate data formatting, aggregation, and alerting.

In the next chapter we will look at how similar techniques can be leveraged to monitor networks for configuration changes.

Exercises

  1. add a -i option to livebar.sh to set the interval in seconds.

  2. add a -M option to livebar.sh to set an expected maximum for input values. Use the getopts builtin to parse your options.

  3. How might you add a -f option that filters data using (using, e.g., grep)? What challenges might you encounter? What approach(es) might you take to deal with those?

  4. Modify wintail.sh to allow the user to specify the Windows log to be monitored by passing in a command line argument.

  5. Modify wintail.sh to add the capability for it to be a lightweight intrusion detection system using egrep and an IOC file.

  6. Consider the statement made in the note about buffering: “When the input is coming from a file, that usually happens quickly.” Why “usually”? Under what conditions might you see the need for the line buffering option on grep even when reading from a file?

About the Authors

Paul Troncone

https://www.linkedin.com/in/paultroncone

https://www.digadel.com

Paul Troncone has over 15 years of experience in the cybersecurity and information technology fields. In 2009 Paul founded the Digadel Corporation where he performs independent cybersecurity consulting and software development. He holds a Bachelor of Arts degree in Computer Science from Pace University, a Master of Science degree in Computer Science from the Tandon School of Engineering at New York University (Formerly Polytechnic University), and is a Certified Information Systems Security Professional. Paul has served in a variety of roles including as a vulnerability analyst, software developer, penetration tester, and college professor.

 

Carl Albing

https://www.linkedin.com/in/albing

Carl Albing is a teacher, researcher, and software engineer with a breadth of industry experience. A co-author of O’Reilly’s “bash Cookbook”, he has worked in software for companies large and small, across a variety of software industries. He has a B.A. in Mathematics, Masters in International Management, and a Ph.D. in Computer Science. He has recently spent time in academia as a Distinguished Visiting Professor in the Department of Computer Science at the U.S. Naval Academy where he taught courses on Programming Languages, Compilers, High Performance Computing, and Advanced Shell Scripting. He is currently a Research Professor in the Data Science and Analytics Group at the Naval Postgraduate School.