Lexical features are derived by analyzing the lexical unit of sentences. Lexical semantics are composed of full words or semiformed words. We will analyze the lexical features in the URL and extract them in accordance with the URLs that are available. We will extract the different URL components, such as the address, comprised of the hostname, the path, and so on.
We start by importing the headers, as shown in the following code:
from url parse import urlparse
import re
import urllib2
import urllib
from xml.dom import minidom
import csv
import pygeoip
Once done, we will import the necessary packages. We then tokenize the URLs. Tokenizing is the process of chopping the URL into several pieces. A token refers to the part that has been broken down into a sequence. When taken together, the tokens are used for semantic processing. Let's look at an example of tokenization using the phrase The quick brown fox jumps over the lazy dog, as shown in the following code:
Tokens are:
The
quick
brown
fox
jumps
over
the
lazy
dog
Before we go ahead and start tokenizing URLs, we need to check whether they are IP addresses by using the following code:
def get_IPaddress(tokenized_words):
count=0;
for element in tokenized_words:
if unicode(element).isnumeric():
count= count + 1
else:
if count >=4 :
return 1
else:
count=0;
if count >=4:
return 1
return 0
We then move on to tokenizing the URLs:
def url_tokenize(url):
tokenized_word=re.split('\W+',url)
num_element = 0
sum_of_element=0
largest=0
for element in tokenized_word:
l=len(element)
sum_of_element+=l
For empty element exclusion in average length, use the following:
if l>0:
num_element+=1
if largest<l:
largest=l
try:
return [float(sum_of_element)/num_element,num_element,largest]
except:
return [0,num_element,largest]
Malicious sites that use phishing URLs to lure people are usually longer in length. Each token is separated by a dot. After researching several previous analyses of malicious emails, we search for these patterns in the tokens.
To search for these patterns of data in the tokens, we will go through the following steps:
- We look for .exe files in the token of the data. If the token shows that the URL contains exe files pointers in the URL, we flag it, as shown in the following code:
def url_has_exe(url):
if url.find('.exe')!=-1:
return 1
else :
return 0
- We then look for common words that are associated with phishing. We count the presence of words such as 'confirm', 'account', 'banking', 'secure', 'rolex', 'login', 'signin', as shown in the following code:
def get_sec_sensitive_words(tokenized_words):
sec_sen_words=['confirm', 'account', 'banking', 'secure', , 'rolex', 'login', 'signin']
count=0
for element in sec_sen_words:
if(element in tokenized_words):
count= count + 1;
return count