In the previous chapters we used scripts to collect data and prepare it for analysis. Now we need to make sense of it all. When analyzing large amounts of data it often helps to start broad and continually narrow the search as new insights are gained into the data.
In this chapter we use the data from web server logs as input into our scripts. This is simply for demonstration purposes. The scripts and techniques can easily be modified to work with nearly any type of data.
We will use an Apache web server access log for for most of the examples in this chapter. This type of log records page requests made to the web server, when they were made, and who made them. A sample of a typical log entry can be seen below. The full log file will be referenced as access.log in this book and can be downloaded at https://www.rapidcyberops.com.
192.168.0.11 - - [12/Nov/2017:15:54:39 -0500] "GET /request-quote.html HTTP/1.1" 200 7326 "http://192.168.0.35/support.html" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0"
Web server logs are used simply as an example. The techniques introduced throughout this chapter can be applied to analyze a variety of data types.
The Apache web server log fields are broken out in Table 6-1.
| Field | Description | Field Number |
|---|---|---|
192.168.0.11 |
IP address of the host that requested the page |
1 |
- |
RFC 1413 Ident protocol identifier (- if not present) |
2 |
- |
The HTTP authenticated user ID (- if not present) |
3 |
[12/Nov/2017:15:54:39 -0500] |
Date, time, and GMT offset (timezone) |
4 - 5 |
GET /request-quote.html |
The page that was requested |
6 - 7 |
HTTP/1.1 |
The HTTP protocol version |
8 |
200 |
The status code returned by the web server |
9 |
7326 |
The size of the file returned in bytes |
10 |
The referring page |
11 |
|
Mozilla/5.0 (Windows NT 6.3; Win64… |
User agent identifying the browser |
12+ |
Note that there is a second type of Apache access log known as the Common Log Format. The format is the same as the Combined Log Format except it does not contain fields for the referring page or user agent. See https://httpd.apache.org/docs/2.4/logs.html for additional information on the Apache log format and configuration.
The Hypertext Transfer Protocol (HTTP) status codes mentioned above are often very informational and let you know how the web server responded to any given request. Common codes are seen in Table 6-2:
| Code | Description |
|---|---|
200 |
OK |
401 |
Unauthorized |
404 |
Page Not Found |
500 |
Internal Server Error |
502 |
Bad Gateway |
For a complete list of codes see the Hypertext Transfer Protocol (HTTP) Status Code Registry at https://www.iana.org/assignments/http-status-codes
We introduce sort, head, and uniq to limit the data we need to process and display. The following file will be used for command examples:
12/05/2017 192.168.10.14 test.html 12/30/2017 192.168.10.185 login.html
The sort command is used to rearrange a text file into numerical and alphabetical order. By default sort will arrange lines in ascending order starting with numbers and then letters. Uppercase letters will be placed before their corresponding lowercase letter unless otherwise specified.
Sort in descending order
Ignore case
Use numerical ordering, so that 1,2,3 all sort before 10. (in the default alphabetic sorting, 2 and 3 would appear after 10.
Sort based on a subset of the data (key) in a line. Fields are delimited by whitespace.
Write output to a specified file.
To sort file1.txt by the file name column and ignore the IP address column you would use the following:
sort -k 2 file1.txt
You can also sort on a subset of the field. To sort by the 2nd octet in the IP address:
sort -k 1.5,1.7 file1.txt
This will sort using characters 5 through 7 of the first field.
The uniq command filters out duplicate lines of data that occur adjacent to one another. To remove all duplicate lines in a file be sure to sort it before using uniq.
Print out the number of times a line is repeated.
Ignore the specified number of fields before comparing. For example, -f 3 will ignore the first three fields in each line. Fields are delimited using spaces.
Ignore letter case. By default uniq is case-sensitive.
When analyzing data for the first time it is often beneficial to start by looking at the extremes; the things that occurred the most or least frequently, the smallest or largest data transfers, etc. For example, consider the data you can collect from web server log files. An unusually high number of page accesses could indicate scanning activity or a denial of service attempt. An unusually high number of bytes downloaded by a host could indicate site cloning or data exfiltration.
To do that you can use the sort, head, and tail commands at the end of a pipeline such as:
… | sort -k 2.1 -rn | head -15
which pipes the output of a script into the sort command and then pipes that sorted output into head that will print the top 15 (in this case) lines. The sort command here is using as its sort key (-k) the second field beginning at its first character (2.1). Moreover, it is doing a reverse sort (-r) and the values will be sorted like numbers (-n). Why a numerical sort? so that 2 shows up between 1 and 3 and not between 19 and 20 (which is alphabetical order).
By using head we take the first lines of the output. We could get the last few lines by piping the output from the sort command into tail instead of head. Using tail -15 would give us the last 15 lines. The other way to do this would be to simply remove the -r option on sort so that it does an ascending rather than descending sort.
A typical web server log can contain tens of thousands of entries. By counting each time a page was accessed, or by which IP address it was accessed from you can gain a better understanding of general site activity. Interesting entries can include:
A high number of requests returning the 404 (Page Not Found) status code for a specific page; this can indicate broken hyperlinks.
A high number of requests from a single IP address returning the 404 status code; this can indicate probing activity looking for hidden or unlinked pages.
A high number of requests returning the 401 (Unauthorized) status code, particularly from the same IP address; this can indicate an attempt at bypassing authentication, such as brute-force password guessing.
To detect this type of activity we need to be able to extract key fields, such as the source IP address, and count the number of times they appear in a file. To accomplish this we will use the cut command to extract the field and then pipe the output into our new tool countem.sh.
#!/bin/bash -## Rapid Cybersecurity Ops# countem.sh## Description:# Count the number of instances of an item using bash## Usage:# countem.sh < inputfile#declare-Acnt# assoc. arraywhilereadidxtradoletcnt[$id]++done# now display what we counted# for each key in the (key, value) assoc. arrayforidin"${!cnt[@]}"doprintf'%d %s\n'"${cnt[$id]}""$id"done
And here is another version, this time using awk:
# Rapid Cybersecurity Ops# countem.awk## Description:# Count the number of instances of an item using awk## Usage:# countem.awk < inputfile#awk'{ cnt[$1]++ }END { for (id in cnt) {printf "%d %s\n", cnt[id], id}}'

Since we don’t know what IP addresses (or other strings) we might encounter, we will use an associative array, declared here with the -A option, so that we can use whatever string we read as our index.
The associative array feature of bash found in bash 4.0 and higher. In such an array, the index doesn’t have to be a number but can be any string. So you can index the array by the IP address and thus count the occurrences of that IP address. In case you’re using something older than bash 4.0, Example 6-4 is an alternate script that uses awk instead.
The array references are like others in bash, using the ${var[index]} syntax to reference an element of the array. To get all the different index values that have been used (the “keys” if you think of these arrays as (key, value) pairings), use: ${!cnt[@]}

While we only expect one word of input per line, we put the variable xtra there to capture any other words that appear on the line. Each variable on a read command gets a assigned the corresponding word from the input (i.e., the first variable gets the first word, the second variable get the second word, and so on), but the last variable gets any and all remaining words. On the other hand, if there are fewer words of input on a line than their are variables on the read command, then those extra variables get set to the empty string. So for our purposes, if there are extra words on the input line, they’ll all be assigned to xtra but if there are no extra words then xtra will be given the value of the null string (which won’t matter either way because we don’t use it.)

Here we use that string as the index and increment its previous value. For the first use of the index, the previous value will be unset, which will be taken as zero.

This syntax lets us iterate over all the various index values that we encountered. Note, however, that the order is not guaranteed -it has to do with the hashing algorithm for the index values, so it is not guaranteed to be in any order such as alphabetical order.

In printing out the value and key we put the values inside quotes so that we always get a single value for each argument - even if that value had a space or two inside it. It isn’t expected to happen with our use of this script, but such coding practices make the scripts more robust when used in other situations.
Both will work nicely in a pipeline of commands like this:
cut -d' ' -f1 logfile | bash countem.sh
or (see note 2 above) just:
bash countem.sh < logfile
For example, to count the number of times an IP address made a HTTP request that resulted in a 404 (page not found) error:
$ awk '$9 == 404 {print $1}' access.log | bash countem.sh
1 192.168.0.36
2 192.168.0.37
1 192.168.0.11
You can also use grep 404 access.log and pipe it into countem.sh, but that would include lines where 404 appears in other places (e.g. the byte count, or part of a file path). The use of awk here restricts the counting only to lines where the returned status (the ninth field) is 404. It then prints just the IP address (field 1) and pipes the output into countem.sh to get the total number of times each IP address made a request that resulted in a 404 error.
To begin analysis of the example access.log file you can start by looking at the hosts that accessed the web server. You can use the Linux cut command to extract the first field of the log file, which contains the source IP address, and then pipe the output into the countem.sh script. The exact command and output is seen below.
$ cut -d' ' -f1 access.log | bash countem.sh | sort -rn 111 192.168.0.37 55 192.168.0.36 51 192.168.0.11 42 192.168.0.14 28 192.168.0.26
If you do not have countem.sh available you can use the uniq command -c option to achieve similar results, but it will require an extra pass through the data using sort to work properly.
$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn 111 192.168.0.37 55 192.168.0.36 51 192.168.0.11 42 192.168.0.14 28 192.168.0.26
Next, you can further investigate by looking at the host that had the most number of requests, which as can be seen above is IP address 192.168.0.37 with 111. You can use awk to filter on the IP address, then pipe that into cut to extract the field that contains the request, and finally pipe that output into countem.sh to provide the total number of requests for each page.
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh
1 /uploads/2/9/1/4/29147191/31549414299.png?457
14 /files/theme/mobile49c2.js?1490908488
1 /cdn2.editmysite.com/images/editor/theme-background/stock/iPad.html
1 /uploads/2/9/1/4/29147191/2992005_orig.jpg
. . .
14 /files/theme/custom49c2.js?1490908488
The activity of this particular host is unimpressive, appearing to be standard web browsing behavior. If you take a look at the host with the next highest number of requests, you will see something a little more interesting.
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f7 | bash countem.sh
1 /files/theme/mobile49c2.js?1490908488
1 /uploads/2/9/1/4/29147191/31549414299.png?457
1 /_/cdn2.editmysite.com/.../Coffee.html
1 /_/cdn2.editmysite.com/.../iPad.html
. . .
1 /uploads/2/9/1/4/29147191/601239_orig.png
This output indicates that host 192.168.0.36 accessed nearly every page on the website exactly one time. This type of activity often indicates webcrawler or site cloning activity. If you take a look at the user agent string provided by the client it further verifies this conclusion.
$ awk '$1 == "192.168.0.36" {print $0}' access.log | cut -d' ' -f12-17 | uniq
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
The user agent identifies itself as HTTrack, which is a tool used to download or clone websites. While not necessarily malicious, it is interesting to note during analysis.
You can find additional information on HTTrack at http://www.httrack.com.
Rather than just count the number of times an IP address or other item occurs, what if you wanted to know the total byte count that has been sent to an IP address - or which IP addresses have requested and received the most data?
The solution is not that much different than countem.sh - you just need a few small changes. First, you need more columns of data by tweaking the input filter (the cut command) to extract two columns (IP address and byte count) rather than just IP address. Second, you will change the calculation from an increment, (let cnt[$id]++) a simple count, to be a summing of that second field of data (let cnt[$id]+=$data).
The pipeline to invoke this will now extract two fields from the logfile, the first and the last.
cut -d' ' -f 1,10 access.log | bash summer.sh
#!/bin/bash -## Rapid Cybersecurity Ops# summer.sh## Description:# Sum the total of field 2 values for each unique field 1## Usage:# Input Format - <input field> <number>#declare-Acnt# assoc. arraywhilereadidcountdoletcnt[$id]+=$countdoneforidin"${!cnt[@]}"doprintf"%-15s %8d\n""${id}""${cnt[${id}]}"done

Note that we’ve made a few other changes to the output format. With the output format, we’ve added field sizes of 15 characters for the first string (the IP address in our sample data), left justified (via the minus sign) and 8 digits for the sum values. If the sum is larger, it will print the larger number, and if the string is longer, it will be printed in full. We’ve done this to get the data to align, by and large, nicely in columns, for readability.
You can run summer.sh against the example access.log file to get an idea of the total amount of data requested by each host. To do this use cut to extract the IP address and bytes transferred fields, and then pipe the output into summer.sh.
$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn 192.168.0.36 4371198 192.168.0.37 2575030 192.168.0.11 2537662 192.168.0.14 2876088 192.168.0.26 665693
These results can be useful in identifying hosts that have transferred unusually large amounts of data compared to other hosts. A spike could indicate data theft and exfiltration. If you identify such a host the next step would be to review the specific pages and files accessed by the suspicious host to try and classify it as malicious or benign.
You can take counting one step further by providing a more visual display of the results. You can take the output from countem.sh or summer.sh and pipe it into yet another script, one that will produce a histogram-like display of the results.
The script to do the printing will take the first field as the index to an associative array; the second field as the value for that array element. It will then iterate through the array and print a number of hashtags to represent the count, scaled to 50 # symbols for the largest count in the list.
#!/bin/bash -## Rapid Cybersecurity Ops# histogram.sh## Description:# Generate a horizontal bar chart of specified data## Usage:# Data input format - <label> <value>#functionpr_bar(){local-iirawmaxrawscaledraw=$1maxraw=$2((scaled=(MAXBAR*raw)/maxraw))# min size guarantee((raw>0&&scaled==0))&&scaled=1for((i=0;i<scaled;i++));doprintf'#';doneprintf'\n'}# pr_bar## "main"#declare-ARAdeclare-iMAXBARmaxmax=0MAXBAR=50# how large the largest bar should bewhilereadlablvaldoletRA[$labl]=$val# keep the largest value; for scaling((val>max))&&max=$valdone# scale and print itforlablin"${!RA[@]}"doprintf'%-20.20s '"$labl"pr_bar${RA[$labl]}$maxdone

We define a function to draw a single bar of the histogram.
This definition must be encountered before a call to the function can be made, so it makes sense to put function definitions at the front of our script.
We will be reusing this function in a future script so we could have put it in a separate file and included it here with a source command - but we didn’t.

We declare all these variables as local because we don’t want them to interfere with variable names in the rest of this script (or any others, if we copy/paste this script to use elsewhere).
We declare all these variables as integer (that’s the -i option) because we are only going to compute values with them and not use them as strings.

The computation is done inside double-parentheses and inside those we don’t need to use the $ to indicate “the value of” each variable name.

This is an “if-less” if statement. If the expression inside the double-parentheses is true then, and only then, is the second expression (the assignment) executed. This will guarantee that scaled is never zero when the raw value is non-zero. Why? Because we’d like something to show up in that case.

The main part of the script begins with a declaration of the RA array as an associative array.

Here we reference the associative array using the label, a string, as its index.

Since the array isn’t index by numbers, we can’t just count integers and use them as indices. This contruct gives all the various strings that were used as an index to the array, one at a time, in the for loop.

We use the label as an index one more time to get the count and pass it as the first parameter to our pr_bar function.
Note that the items don’t appear in the same order as the input. That’s because the hashing algorithm for the key (the index) doesn’t preserve ordering. You could take this output and pipe it into yet another sort, or you could take a slightly different approach.
Here’s a version of the histogram script that preserves order - by not using an associative array. This might also be useful on older versions of bash (pre 4.0), prior to the introduction of associative arrays. Only the “main” part of the script is shown as the function pr_bar remains the same.
#!/bin/bash -## Rapid Cybersecurity Ops# histogram_plain.sh## Description:# Generate a horizontal bar chart of specified data without# using associative arrays, good for older versions of bash## Usage:# Data input format - <label> <value>#declare-aRA_keyRA_valdeclare-imaxndxmax=0maxbar=50# how large the largest bar should bendx=0whilereadlablvaldoRA_key[$ndx]=$lablRA_value[$ndx]=$val# keep the largest value; for scaling((val>max))&&max=$valletndx++done# scale and print itfor((j=0;j<ndx;j++))doprintf"%-20.20s "${RA_key[$j]}pr_bar${RA_value[$j]}$maxdone
This version of the script avoids the use of associative arrays - in case you are running an older version of bash (prior to 4.x), such as on MacOS systems. For this version we use two separate arrays, one for the index value and one for the counts. Since they are normal arrays we have to use an integer index and so we will keep a simple count in the variable ndx.

Here the variable names are declared as arrays. The lower-case a says that they are arrays, but not of the “associative” variety. While not strictly necessary, it is good practice.

The key and value pairs are stored in separate arrays, but at the same index location. This approach is “brittle” - that is, easily broken, if changes to the script ever got the two arrays out of sync.

Now the for loop, unlike the previous script, is a simple counting of an integer from 0 to ndx. The variable j is used here so as not to interfere with the index in the for looop inside pr_bar although we were careful enough inside the function to declare its version of i as local to the function. Do you trust it? Change the j to an i here and see if it still works (It does). Then try removing the local declaration and see if it fails (It does).
This approach with the two arrays does have one advantage. By using the numerical index for storing the label and the data you can retrieve them in order they were read in - in the numerical order of the index.
You can now visually see the hosts that transferred the largest number of bytes by extracting the appropriate fields from access.log, piping the results into summer.sh and then into histogram.sh.
$ cut -d' ' -f1,10 access.log | bash summer.sh | bash histogram.sh 192.168.0.36 ################################################## 192.168.0.37 ############################# 192.168.0.11 ############################# 192.168.0.14 ################################ 192.168.0.26 #######
While this might not seem that useful for the small amount of sample data, being able to visualize trends is invaluable when looking across larger datasets.
In addition to looking at the number of bytes transferred by IP address or host, it is often interesting to look at the data by date and time. To do that you can use the summer.sh script, but due to the format of the access.log file you need to do a little more processing before you can pipe it into the script. If you use cut to extract the date/time and bytes transferred fields you are left with data that causes some problems for the script.
$ cut -d' ' -f4,10 access.log [12/Nov/2017:15:52:59 2377 [12/Nov/2017:15:52:59 4529 [12/Nov/2017:15:52:59 1112
As seen in the output above, the raw data starts with a [ character. That causes a problem with the script because it denotes the beginning of an array in bash. To remedy that you can use an additional iteration of the cut command to remove the character using -c2- as an option. This option tells cut to extract the data by character starting at position 2 and going to the end of the line (-). The corrected output with the square bracket removed can be seen below.
$ cut -d' ' -f4,10 access.log | cut -c2- 12/Nov/2017:15:52:59 2377 12/Nov/2017:15:52:59 4529 12/Nov/2017:15:52:59 1112
Alternatively, you can use tr in place of the second cut. The -d option will delete the character specified, in this case the square bracket.
cut -d' ' -f4,10 access.log | tr -d '['
You also need to determine how you want to group the time-bound data; by day, month, year, hour, etc. You can do this by simply modifying the option for the second cut iteration. The table below illustrates the cut option to use to extract various forms of the date/time field. Note that these cut options are specific to Apache log files.
| Date/Time Extracted | Example Output | Cut Opton |
|---|---|---|
Entire date/time |
12/Nov/2017:19:26:09 |
-c2- |
Month, Day, and Year |
-c2-12 |
Month and year |
Nov/2017 |
-c5-12,22- |
Full Time |
19:26:04 |
-c14- |
Hour |
19 |
-c14-15,22- |
Year |
The histogram.sh script can be particularly useful when looking at time-based data. For example, if your organization has an internal web server that is only accessed during working hours of 9:00 AM to 5:00 PM, you can review the server log file on a daily basis using the histogram view and see if there are any spikes in activity outside of normal working hours. Large spikes of activity or data transfer outside of normal working hours could indicate exfiltration by a malicious actor. If any anomalies are detected you can filter the data by that particular date and time and review the page accesses to determine if the activity is malicious.
For example, if you want to see a histogram of the total amount of data that was retrieved on a certain day and on an hourly basis you can do the following:
$ awk '$4 ~ "12/Nov/2017" {print $0}' access.log | cut -d' ' -f4,10 |
cut -c14-15,22- | bash summer.sh | bash histogram.sh
17 ##
16 ###########
15 ############
19 ##
18 ##################################################
Here the access.log file is sent through awk to extract the entries from a particular date. Note the use of the like operator (~) in stead of == since field 4 also contains time information. Those entries are piped into cut to extract the date/time and bytes transferred fields, and then piped into cut again to extract just the hour. From there it is summed by hour using summer.sh and converted into a histogram using histogram.sh. The result is a histogram that displays the total number of bytes transferred each hour on November 12, 2017.
Previously IP address 192.168.0.37 was identified as the system that had the largest number of page requests. The next logical question is what pages did this system request? With that answer you can start to gain an understanding of what the system was doing on the server and categorize the activity as benign, suspicious, or malicious. To accomplish that you can use awk and cut and pipe the output into countem.sh.
$ awk '$1 == "192.168.0.37" {print $0}' access.log | cut -d' ' -f7 |
bash countem.sh | sort -rn | head -5
14 /files/theme/plugin49c2.js?1490908488
14 /files/theme/mobile49c2.js?1490908488
14 /files/theme/custom49c2.js?1490908488
14 /files/main_styleaf0e.css?1509483497
3 /consulting.html
While this can be accomplished by piping together commands and scripts, that requires multiple passes through the data. This may work for many datasets, but it is too inefficient for extremely large datasets. You can streamline this by writing a bash script specifically designed to extract and count page accesses, and only requires a single pass over the data.
# Rapid Cybersecurity Ops# pagereq.sh## Description:# Count the number of page requests for a given IP address using bash## Usage:# pagereq <ip address> < inputfile# <ip address> IP address to search for#declare-Acntwhilereadaddrd1d2datimgmtoffgetrpagetherestdoif[[$1==$addr]];thenletcnt[$page]+=1;fidoneforidin${!cnt[@]}doprintf"%8d %s\n"${cnt[$id]}$iddone

We declare cnt as an associative array (also known as a hash table or dictionary) so that we can use a string as the index to the array. In this program we will be using the page address (the URL) as the index.

The ${!cnt[@]} results in a list of all the different index values that have been encountered. Note, however, that they are not listed in any useful order.
Early versions of bash don’t have associative arrays. You can use awk to do the same thing - counting the various page requests from a particular ip address - since awk has associative arrays.
# Rapid Cybersecurity Ops# pagereq.awk## Description:# Count the number of page requests for a given IP address using awk## Usage:# pagereq <ip address> < inputfile# <ip address> IP address to search for## count the number of page requests from an address ($1)awk-vpage="$1"'{ if ($1==page) {cnt[$7]+=1 } }END { for (id in cnt) {printf "%8d %s\n", cnt[id], id } }'

There are two very different $1 variables on this line.
The first $1 is a shell variable and refers to the first argument supplied to this script when it is invoked.
The second $1 is an awk variable. It refers to the first field of the input on each line.
The first $1 has been assigned to the awk variable page so that it can be compared to each $1 of awk - that is, to each first field of the input data.

This simple syntax results in the varialbe id iterating over the values of the index values to the cnt array. It is much simpler syntax than the shell’s "${!cnt[@]}" syntax, but with the same effect.
You can run pagereq.sh by providing the IP address you would like to search for and redirect access.log as input.
$ bash pagereq.sh 192.168.0.37 < access.log | sort -rn | head -5 14 /files/theme/plugin49c2.js?1490908488 14 /files/theme/mobile49c2.js?1490908488 14 /files/theme/custom49c2.js?1490908488 14 /files/main_styleaf0e.css?1509483497 3 /consulting.html
On the web a User Agent String is a small piece of textual information sent by a browser to a web server that identifies the client’s operating system, browser type, version, and other information. It is typically used by web servers to ensure page compatibility with the user’s browser. Here is an example of a user agent string:
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
This user agent string identifies the system as: Windows NT version 6.3 (aka Windows 8.1); 64-bit architecture; and using the Firefox browser.
The user agent string is interesting for a few reasons: first because of the significant amount of information it conveys, which, can be used to identify the types of systems and browsers accessing the server; second because it is configurable by the end user, which, can be used to identify systems that may not be using a standard browser or may not be using a browser at all (i.e. a webcrawler).
You can identify unusual user agents by first compiling a list of known good user agents. For the purposes of this exercise we will use a very small list that is not specific to a particular version.
Firefox Chrome Safari Edge
For a list of common user agent strings visit https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
You can then read in a web server log and compare each line to each valid user agent until you get a match. If no match is found it should be considered an anomaly and printed to standard out along with the IP address of the system making the request. This provides yet another vantage point into the data, identifying systems with unusual user agents, and another path to further explore.
#!/bin/bash -## Rapid Cybersecurity Ops# useragents.sh## Description:# Read through a log looking for unknown user agents## Usage:# useragents.txt < <inputfile># <inputfile> Apache access log## mismatch - search through the array of known names# returns 1 (false) if it finds a match# returns 0 (true) if there is no matchfunctionmismatch(){local-iifor((i=0;i<$KNSIZE;i++))do[["$1"=~.*${KNOWN[$i]}.*]]&&return1donereturn0}# read up the known onesreadarray-tKNOWN<"useragents.txt"KNSIZE=${#KNOWN[@]}# preprocess logfile (stdin) to pick out ipaddr and user agentawk-F'"''{print $1, $6}'|\whilereadipaddrdash1dash2dtstampdeltauseragentdoifmismatch"$useragent"thenecho"anomaly:$ipaddr$useragent"fidone

We will use a function for the core of this script.
It will return a success (or “true”) if it finds a mismatch, that is, if it finds no match against the list of known user agents.
This logic may seem a bit inverted, but it makes the if statement containing the call to mismatch read clearly.

Declaring our for loop index as a local variable is good practice.
It’s not strictly necessary in this script but is a good habit.

There are two strings to compare - the input from the logfile and a line from the list of known user agents.
To make for a very flexible comparison we use the regex comparison operator (the =~).
The .* (meaning “zero or more instances of any character”) placed on either side of the $KNOWN array reference means that the known string can appear anywhere within the other string for a match.

Each line of the file is added as an element to the array name specified.
This gives us an array of known user agents.
There are two identical ways to do this in bash either readarray, as used here, or mapfile. The -t option removes the trailing newline from each line read.
The file containing the list of known user agents is specified here; modify as needed.

This computes the size of the array.
It is used inside the mismatch function to loop through the array.
We calculate it here, once, outside our loop to avoid recomputing it every time the function is called.

The input string is a complex mix of words and quote marks.
To capture the user agent string we use the double-quote as the field separator.
Doing that, however, means that our first field contains more than just the ip address.
By using the bash read we can parse on the spaces to get the ip address.
The last argument of the read takes all the remaining words and so it can capture all the several words of the user agent string.
In this chapter we looked at techniques to analyze the content of log files by identifying unusual and anomolous activity. This type of analysis can provide you with insights into what occurred in the past. In the next chapter we will look at how to analyze log files and other data to provide insights into what is happening in the system in real time.
The example use of summer.sh used cut to print the 1st and 10th fields of the access.log file, like this:
$ cut -d' ' -f1,10 access.log | bash summer.sh | sort -k 2.1 -rn
Replace the cut command by using the awk command. Do you get the same results? What might be different about those two approaches?
Expand the histogram.sh script to include the count at the end of each histogram bar. Here is sample output:
192.168.0.37 ############################# 2575030 192.168.0.26 ####### 665693
Expand the histogram.sh script to allow the user to supply the option -s that specifies the maximum bar size. For example histogram.sh -s 25 would limit the maximum bar size to 25 # characters. The default should remain at 50 if no option is given.
Download the following web log file TODO: Add Log File URL.
Which IP address made the most number of requests?
Which page was accessed the most number of times?
Download the following Domain Name System (DNS) server log TODO: Add Log File URL
What was the most requested domain?
What day had the most number of requests?
Modify the useragents.sh script to add some parameters
Add code for an optional first parameter to be a filename of the known hosts. If not specified, default to the name known.hosts as it currently is used.
Add code for a -f option to take an argument. The argument is the filename of the logfile to read rather than reading from stdin.
Modify the pagereq.sh script to not need an associative array but to work with a traditional array that uses a numerical index. Convert the ip address into a 10-12 digit number for that use. Caution: don’t have leading zeros on the number or the shell will attempt to interpret it as an octal number. Example: convert “10.124.16.3” into “10124016003” which can be used as a numerical index.