Chapter 6. Data Analysis

In the previous chapters we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data it often helps to start broad and continually narrow the search as new insights are gained into the data.

In this chapter we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.

We will use an Apache web server access log for for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical log entry can be seen below. The full log file will be referenced as access.log in this book and can be downloaded at https://www.rapidcyberops.com.

Example 6-1. Sample from access.log

192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200
7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:56.0)
Gecko/20100101 Firefox/56.0"

Note

Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.

The Apache web server log fields are broken out in Table 6-1.

Table 6-1. Apache Web Server Combined Log Format Fields
Field	Description	Field Number
192.168.0.11	IP address of the host that requested the page	1
-	RFC 1413 Ident protocol identifier (- if not present)	2
-	The HTTP authenticated user ID (- if not present)	3
[12/Nov/2017:15:54:39 -0500]	Date, time, and GMT offset (timezone)	4 - 5
GET /request-quote.html	The page that was requested	6 - 7
HTTP/1.1	The HTTP protocol version	8
200	The status code returned by the web server	9
7326	The size of the file returned in bytes	10
http://192.168.0.35/support.html	The referring page	11
Mozilla/5.0 (Windows NT 6.3; Win64…	User agent identifying the browser	12+

Note that there is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See https://httpd.apache.org/docs/2.4/logs.html for additional information on the Apache log format and configuration.

The Hypertext Transfer Protocol (HTTP) status codes mentioned above are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 6-2:

Table 6-2. HTTP Status Codes
Code	Description
200	OK
401	Unauthorized
404	Page Not Found
500	Internal Server Error
502	Bad Gateway

Tip

For a complete list of codes see the Hypertext Transfer Protocol (HTTP) Status Code Registry at https://www.iana.org/assignments/http-status-codes

Commands in use

We introduce sort, head, and uniq to limit the data we need to process and display. The following file will be used for command examples:

Example 6-2. file1.txt

12/05/2017 192.168.10.14 test.html
12/30/2017 192.168.10.185 login.html

sort

The sort command is used to rearrange a text file into numerical and alphabetical order. By default sort will arrange lines in ascending order starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letter unless otherwise specified.

Common Command Options

-r: Sort in descending order
-f: Ignore case
-n: Use numerical ordering, so that 1,2,3 all sort before 10. (in the default alphabetic sorting, 2 and 3 would appear after 10.
-k: Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.
-o: Write output to a specified file.

Command Example

To sort file1.txt by the file name column and ignore the IP address column you would use the following:

sort -k 2 file1.txt

You can also sort on a subset of the field. To sort by the 2nd octet in the IP address:

sort -k 1.5,1.7 file1.txt

This will sort using characters 5 through 7 of the first field.

uniq

The uniq command filters out duplicate lines of data that occur adjacent to one another. To remove all duplicate lines in a file be sure to sort it before using uniq.

Common Command Options

-c: Print out the number of times a line is repeated.
-f: Ignore the specified number of fields before comparing. For example, -f 3 will ignore the first three fields in each line. Fields are delimited using spaces.
-i: Ignore letter case. By default uniq is case-sensitive.

Sorting and Arranging Data

When analyzing data for the first time it is often beneficial to start by looking at the extremes; the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data you can collect from web server log files. An unusually high number of page accesses could indicate scanning activity or a denial of service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.

To do that you can use the sort, head, and tail commands at the end of a pipeline such as:

…   | sort -k 2.1 -rn | head -15

which pipes the output of a script into the sort command and then pipes that sorted output into head that will print the top 15 (in this case) lines. The sort command here is using as its sort key (-k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse sort (-r) and the values will be sorted like numbers (-n). Why a numerical sort? so that 2 shows up between 1 and 3 and not between 19 and 20 (which is alphabetical order).

By using head we take the first lines of the output. We could get the last few lines by piping the output from the sort command into tail instead of head. Using tail -15 would give us the last 15 lines. The other way to do this would be to simply remove the -r option on sort so that it does an ascending rather than descending sort.

Counting Occurrences in Data

A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from you can gain a better understanding of general site activity. Interesting entries can include:

A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.
A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.
A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.

To detect this type of activity we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this we will use the cut command to extract the field and then pipe the output into our new tool countem.sh.

Example 6-3. countem.sh

#!/bin/bash -
#
# Rapid Cybersecurity Ops
# countem.sh
#
# Description:
# Count the number of instances of an item using bash
#
# Usage:
# countem.sh < inputfile
#

declare -A cnt        # assoc. array             
while read id xtra                               
do
    let cnt[$id]++                               
done
# now display what we counted
# for each key in the (key, value) assoc. array
for id in "${!cnt[@]}"                           
do
    printf '%d %s\n'  "${cnt[$id]}"  "$id"       
done

And here is another version, this time using awk:

Example 6-4. countem.awk

# Rapid Cybersecurity Ops
# countem.awk
#
# Description:
# Count the number of instances of an item using awk
#
# Usage:
# countem.awk < inputfile
#

awk '{ cnt[$1]++ }
END { for (id in cnt) {
        printf "%d %s\n", cnt[id], id
      }
    }'

Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array, declared here with the -A option, so that we can use whatever string we read as our index.

The associative array feature of bash found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you’re using something older than bash 4.0, Example 6-4 is an alternate script that uses awk instead.

The array references are like others in bash, using the ${var[index]} syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}

While we only expect one word of input per line, we put the variable xtra there to capture any other words that appear on the line. Each variable on a read command gets a assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable get the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than their are variables on the read command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra but if there are no extra words then xtra will be given the value of the null string (which won’t matter either way because we don’t use it.)

Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.

This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed -it has to do with the hashing algorithm for the index values, so it is not guaranteed to be in any order such as alphabetical order.

In printing out the value and key we put the values inside quotes so that we always get a single value for each argument - even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.

Both will work nicely in a pipeline of commands like this:

cut -d' ' -f1 logfile | bash countem.sh

or (see note 2 above) just:

bash countem.sh < logfile

For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (page not found) error:

$ awk '$9 == 404 {print $1}' access.log | bash countem.sh

1 192.168.0.36
2 192.168.0.37
1 192.168.0.11

You can also use grep 404 access.log and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g. the byte count, or part of a file path). The use of awk here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.

To begin analysis of the example access.log file you can start by looking at the hosts that accessed the web server. You can use the Linux cut command to extract the first field of the log file, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is seen below.

$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26

Tip

If you do not have countem.sh available you can use the uniq command -c option to achieve similar results, but it will require an extra pass through the data using sort to work properly.

$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn

111 192.168.0.37
55 192.168.0.36
51 192.168.0.11
42 192.168.0.14
28 192.168.0.26

Next, you can further investigate by looking at the host that had the most number of requests, which as can be seen above is IP address 192.168.0.37 with 111. You can use awk to filter on the IP address, then pipe that into cut to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488

The activity of this particular host is unimpressive, appearing to be standard web browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh

1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png

This output indicates that host 192.168.0.36 accessed nearly every page on the website exactly one time. This type of activity often indicates webcrawler or site cloning activity. If you take a look at the user agent string provided by the client it further verifies this conclusion.

$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq

"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

The user agent identifies itself as HTTrack, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.

Tip

You can find additional information on HTTrack at http://www.httrack.com.

Totaling Numbers in Data

Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address - or which IP addresses have requested and received the most data?

The solution is not that much different than countem.sh - you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data).

The pipeline to invoke this will now extract two fields from the logfile, the first and the last.

cut -d' ' -f 1,10 access.log | bash summer.sh

Example 6-5. summer.sh

#!/bin/bash -
#
# Rapid Cybersecurity Ops
# summer.sh
#
# Description:
# Sum the total of field 2 values for each unique field 1
#
# Usage:
# Input Format - <input field> <number>
#

declare -A cnt        # assoc. array
while read id count
do
  let cnt[$id]+=$count
done
for id in "${!cnt[@]}"
do
    printf "%-15s %8d\n"  "${id}"  "${cnt[${id}]}" 
done

: Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left justified (via the minus sign) and 8 digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.

You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this use cut to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

192.168.0.36     4371198
192.168.0.37     2575030
192.168.0.11     2537662
192.168.0.14     2876088
192.168.0.26      665693

These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host the next step would be to review the specific pages and files accessed by the suspicious host to try and classify it as malicious or benign.

Displaying Data in a Histogram

You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.

The script to do the printing will take the first field as the index to an associative array; the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the list.

Example 6-6. histogram.sh

#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram.sh
#
# Description:
# Generate a horizontal bar chart of specified data
#
# Usage:
# Data input format - <label> <value>
#

function pr_bar ()                            
{
    local -i i raw maxraw scaled              
    raw=$1
    maxraw=$2
    ((scaled=(MAXBAR*raw)/maxraw))            
    # min size guarantee
    ((raw > 0 && scaled == 0)) && scaled=1				

    for((i=0; i<scaled; i++)) ; do printf '#' ; done
    printf '\n'

} # pr_bar

#
# "main"
#
declare -A RA						
declare -i MAXBAR max
max=0
MAXBAR=50	# how large the largest bar should be

while read labl val
do
    let RA[$labl]=$val					
    # keep the largest value; for scaling
    (( val > max )) && max=$val
done

# scale and print it
for labl in "${!RA[@]}"					
do
    printf '%-20.20s  ' "$labl"
    pr_bar ${RA[$labl]} $max				
done

: We define a function to draw a single bar of the histogram. This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script. We will be reusing this function in a future script so we could have put it in a separate file and included it here with a source command - but we didn’t.
: We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere). We declare all these variables as integer (that’s the -i option) because we are only going to compute values with them and not use them as strings.
: The computation is done inside double-parentheses and inside those we don’t need to use the $ to indicate “the value of” each variable name.
: This is an “if-less” if statement. If the expression inside the double-parentheses is true then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled is never zero when the raw value is non-zero. Why? Because we’d like something to show up in that case.
: The main part of the script begins with a declaration of the RA array as an associative array.
: Here we reference the associative array using the label, a string, as its index.
: Since the array isn’t index by numbers, we can’t just count integers and use them as indices. This contruct gives all the various strings that were used as an index to the array, one at a time, in the for loop.
: We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar function.

Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort, or you could take a slightly different approach.

Here’s a version of the histogram script that preserves order - by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown as the function pr_bar remains the same.

Example 6-7. histogram_plain.sh

#!/bin/bash -
#
# Rapid Cybersecurity Ops
# histogram_plain.sh
#
# Description:
# Generate a horizontal bar chart of specified data without
# using associative arrays, good for older versions of bash
#
# Usage:
# Data input format - <label> <value>
#

declare -a RA_key RA_val                                 
declare -i max ndx
max=0
maxbar=50    # how large the largest bar should be

ndx=0
while read labl val
do
    RA_key[$ndx]=$labl                                   
    RA_value[$ndx]=$val
    # keep the largest value; for scaling
    (( val > max )) && max=$val
    let ndx++
done

# scale and print it
for ((j=0; j<ndx; j++))                                  
do
    printf "%-20.20s  " ${RA_key[$j]}
    pr_bar ${RA_value[$j]} $max
done

This version of the script avoids the use of associative arrays - in case you are running an older version of bash (prior to 4.x), such as on MacOS systems. For this version we use two separate arrays, one for the index value and one for the counts. Since they are normal arrays we have to use an integer index and so we will keep a simple count in the variable ndx.

: Here the variable names are declared as arrays. The lower-case a says that they are arrays, but not of the “associative” variety. While not strictly necessary, it is good practice.
: The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle” - that is, easily broken, if changes to the script ever got the two arrays out of sync.
: Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to ndx. The variable j is used here so as not to interfere with the index in the for looop inside pr_bar although we were careful enough inside the function to declare its version of i as local to the function. Do you trust it? Change the j to an i here and see if it still works (It does). Then try removing the local declaration and see if it fails (It does).

This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data you can retrieve them in order they were read in - in the numerical order of the index.

You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh and then into histogram.sh.

$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh

192.168.0.36          ##################################################
192.168.0.37          #############################
192.168.0.11          #############################
192.168.0.14          ################################
192.168.0.26          #######

While this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.

In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that you can use the summer.sh script, but due to the format of the access.log file you need to do a little more processing before you can pipe it into the script. If you use cut to extract the date/time and bytes transferred fields you are left with data that causes some problems for the script.

$ cut -d' ' -f4,10 access.log

[12/Nov/2017:15:52:59 2377
[12/Nov/2017:15:52:59 4529
[12/Nov/2017:15:52:59 1112

As seen in the output above, the raw data starts with a [ character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that you can use an additional iteration of the cut command to remove the character using -c2- as an option. This option tells cut to extract the data by character starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed can be seen below.

$ cut -d' ' -f4,10 access.log | cut -c2-

12/Nov/2017:15:52:59 2377
12/Nov/2017:15:52:59 4529
12/Nov/2017:15:52:59 1112

Tip

Alternatively, you can use tr in place of the second cut. The -d option will delete the character specified, in this case the square bracket.

cut -d' ' -f4,10 access.log | tr -d '['

You also need to determine how you want to group the time-bound data; by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut iteration. The table below illustrates the cut option to use to extract various forms of the date/time field. Note that these cut options are specific to Apache log files.

Table 6-3. Apache Log Date/Time Field Extraction
Date/Time Extracted	Example Output	Cut Opton
Entire date/time	12/Nov/2017:19:26:09	-c2-
Month, Day, and Year	-c2-12	Month and year
Nov/2017	-c5-12,22-	Full Time
19:26:04	-c14-	Hour
19	-c14-15,22-	Year

The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is only accessed during working hours of 9:00 AM to 5:00 PM, you can review the server log file on a daily basis using the histogram view and see if there are any spikes in activity outside of normal working hours. Large spikes of activity or data transfer outside of normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected you can filter the data by that particular date and time and review the page accesses to determine if the activity is malicious.

For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis you can do the following:

$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 |
cut -c14-15,22- | bash summer.sh | bash histogram.sh

17              ##
16              ###########
15              ############
19              ##
18              ##################################################

Here the access.log file is sent through awk to extract the entries from a particular date. Note the use of the like operator (~) in stead of == since field 4 also contains time information. Those entries are piped into cut to extract the date/time and bytes transferred fields, and then piped into cut again to extract just the hour. From there it is summed by hour using summer.sh and converted into a histogram using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.

Finding Uniqueness in Data

Previously IP address 192.168.0.37 was identified as the system that had the largest number of page requests. The next logical question is what pages did this system request? With that answer you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that you can use awk and cut and pipe the output into countem.sh.

$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 |
bash countem.sh | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

While this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and only requires a single pass over the data.

Example 6-8. pagereq.sh

# Rapid Cybersecurity Ops
# pagereq.sh
#
# Description:
# Count the number of page requests for a given IP address using bash
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

declare -A cnt                                             
while read addr d1 d2 datim gmtoff getr page therest
do
    if [[ $1 == $addr ]] ; then let cnt[$page]+=1 ; fi
done
for id in ${!cnt[@]}                                       
do
    printf "%8d %s\n" ${cnt[$id]} $id
done

: We declare cnt as an associative array (also known as a hash table or dictionary) so that we can use a string as the index to the array. In this program we will be using the page address (the URL) as the index.
: The ${!cnt[@]} results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.

Early versions of bash don’t have associative arrays. You can use awk to do the same thing - counting the various page requests from a particular ip address - since awk has associative arrays.

Example 6-9. pagereq.awk

# Rapid Cybersecurity Ops
# pagereq.awk
#
# Description:
# Count the number of page requests for a given IP address using awk
#
# Usage:
# pagereq <ip address> < inputfile
#   <ip address> IP address to search for
#

# count the number of page requests from an address ($1)
awk -v page="$1" '{ if ($1==page) {cnt[$7]+=1 } }                
END { for (id in cnt) {                                          
    printf "%8d %s\n", cnt[id], id
    }
}'

: There are two very different $1 variables on this line. The first $1 is a shell variable and refers to the first argument supplied to this script when it is invoked. The second $1 is an awk variable. It refers to the first field of the input on each line. The first $1 has been assigned to the awk variable page so that it can be compared to each $1 of awk - that is, to each first field of the input data.
: This simple syntax results in the varialbe id iterating over the values of the index values to the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with the same effect.

You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input.

$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5

14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html

Identifying Anomalies in Data

On the web a User Agent String is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user agent string:

Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0

This user agent string identifies the system as: Windows NT version 6.3 (aka Windows 8.1); 64-bit architecture; and using the Firefox browser.

The user agent string is interesting for a few reasons: first because of the significant amount of information it conveys, which, can be used to identify the types of systems and browsers accessing the server; second because it is configurable by the end user, which, can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e. a webcrawler).

You can identify unusual user agents by first compiling a list of known good user agents. For the purposes of this exercise we will use a very small list that is not specific to a particular version.

Example 6-10. useragents.txt

Firefox
Chrome
Safari
Edge

Tip

For a list of common user agent strings visit https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found it should be considered an anomaly and printed to standard out along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.

Example 6-11. useragents.sh

#!/bin/bash -
#
# Rapid Cybersecurity Ops
# useragents.sh
#
# Description:
# Read through a log looking for unknown user agents
#
# Usage:
# useragents.txt < <inputfile>
#   <inputfile> Apache access log
#


# mismatch - search through the array of known names
#  returns 1 (false) if it finds a match
#  returns 0 (true) if there is no match
function mismatch ()                                    
{
    local -i i                                          
    for ((i=0; i<$KNSIZE; i++))
    do
        [[ "$1" =~ .*${KNOWN[$i]}.* ]] && return 1      
    done
    return 0
}

# read up the known ones
readarray -t KNOWN < "useragents.txt"                      
KNSIZE=${#KNOWN[@]}                                     

# preprocess logfile (stdin) to pick out ipaddr and user agent
awk -F'"' '{print $1, $6}' | \
while read ipaddr dash1 dash2 dtstamp delta useragent   
do
    if mismatch "$useragent"
    then
        echo "anomaly: $ipaddr $useragent"
    fi
done

: We will use a function for the core of this script. It will return a success (or “true”) if it finds a mismatch, that is, if it finds no match against the list of known user agents. This logic may seem a bit inverted, but it makes the if statement containing the call to mismatch read clearly.
: Declaring our for loop index as a local variable is good practice. It’s not strictly necessary in this script but is a good habit.
: There are two strings to compare - the input from the logfile and a line from the list of known user agents. To make for a very flexible comparison we use the regex comparison operator (the =~). The .* (meaning “zero or more instances of any character”) placed on either side of the $KNOWN array reference means that the known string can appear anywhere within the other string for a match.
: Each line of the file is added as an element to the array name specified. This gives us an array of known user agents. There are two identical ways to do this in bash either readarray, as used here, or mapfile. The -t option removes the trailing newline from each line read. The file containing the list of known user agents is specified here; modify as needed.
: This computes the size of the array. It is used inside the mismatch function to loop through the array. We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.
: The input string is a complex mix of words and quote marks. To capture the user agent string we use the double-quote as the field separator. Doing that, however, means that our first field contains more than just the ip address. By using the bash read we can parse on the spaces to get the ip address. The last argument of the read takes all the remaining words and so it can capture all the several words of the user agent string.

Summary

In this chapter we looked at techniques to analyze the content of log files by identifying unusual and anomolous activity. This type of analysis can provide you with insights into what occurred in the past. In the next chapter we will look at how to analyze log files and other data to provide insights into what is happening in the system in real time.

Exercises

The example use of summer.sh used cut to print the 1st and 10th fields of the access.log file, like this:

$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn

Replace the cut command by using the awk command. Do you get the same results? What might be different about those two approaches?

Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:
```
192.168.0.37          #############################    2575030
192.168.0.26          ####### 665693
```
Expand the histogram.sh script to allow the user to supply the option -s that specifies the maximum bar size. For example histogram.sh -s 25 would limit the maximum bar size to 25 # characters. The default should remain at 50 if no option is given.
Download the following web log file TODO: Add Log File URL.
1. Which IP address made the most number of requests?
2. Which page was accessed the most number of times?
Download the following Domain Name System (DNS) server log TODO: Add Log File URL
1. What was the most requested domain?
2. What day had the most number of requests?
Modify the useragents.sh script to add some parameters
1. Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts as it currently is used.
2. Add code for a -f option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.
Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the ip address into a 10-12 digit number for that use. Caution: don’t have leading zeros on the number or the shell will attempt to interpret it as an octal number. Example: convert “10.124.16.3” into “10124016003” which can be used as a numerical index.

Previous Chapter

5. Data Processing

Next Chapter

7. Real-Time Log Monitoring

Table of Contents for Rapid Cybersecurity Ops

Chapter 6. Data Analysis

Example 6-1. Sample from access.log

Note

Tip

Commands in use

Example 6-2. file1.txt

sort

Common Command Options

Command Example

uniq

Common Command Options

Sorting and Arranging Data

Counting Occurrences in Data

Example 6-3. countem.sh

Example 6-4. countem.awk

Tip

Tip

Totaling Numbers in Data

Example 6-5. summer.sh

Displaying Data in a Histogram

Example 6-6. histogram.sh

Example 6-7. histogram_plain.sh

Tip

Finding Uniqueness in Data

Example 6-8. pagereq.sh

Example 6-9. pagereq.awk

Identifying Anomalies in Data

Example 6-10. useragents.txt

Tip

Example 6-11. useragents.sh

Summary

Exercises

Table of Contents for
Rapid Cybersecurity Ops