The TF-IDF is used to measure how important a selected word is with respect to the entire document. This word is chosen from a corpus of words.
We need to generate the TF-IDF from the URLs by using the following code:
url_vectorizer = TfidfVectorizer(tokenizer=url_cleanse)
x = url_vectorizer.fit_transform(inputurls)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
We then perform a logistic regression on the data frame, as follows:
l_regress = LogisticRegression() # Logistic regression
l_regress.fit(x_train, y_train)
l_score = l_regress.score(x_test, y_test)
print("score: {0:.2f} %".format(100 * l_score))
url_vectorizer_save = url_vectorizer
Finally we save the model and the vector in the file so that we can use it later, as follows:
file = "model.pkl"
with open(file, 'wb') as f:
pickle.dump(l_regress, f)
f.close()
file2 = "vector.pkl"
with open(file2,'wb') as f2:
pickle.dump(vectorizer_save, f2)
f2.close()
We will test the model we fitted in the preceding code to check whether it can predict the goodness or badness of URLs properly, as shown in the following code:
#We load a bunch of urls that we want to check are legit or not
urls = ['hackthebox.eu','facebook.com']
file1 = "model.pkl"
with open(file1, 'rb') as f1:
lgr = pickle.load(f1)
f1.close()
file2 = "pvector.pkl"
with open(file2, 'rb') as f2:
url_vectorizer = pickle.load(f2)
f2.close()
url_vectorizer = url_vectorizer
x = url_vectorizer.transform(inputurls)
y_predict = l_regress.predict(x)
print(inputurls)
print(y_predict)
However, there is a problem with the specified model. This is because there are URLs that could already be identified as good or bad. We do not have to classify them again. Instead, we can create a whitelist file, as follows:
# We can use the whitelist to make the predictions
whitelisted_url = ['hackthebox.eu','root-me.org']
some_url = [i for i in inputurls if i not in whitelisted_url]
file1 = "model.pkl"
with open(file1, 'rb') as f1:
l_regress = pickle.load(f1)
f1.close()
file2 = "vector.pkl"
with open(file2, 'rb') as f2:
url_vectorizer = pickle.load(f2)
f2.close()
url_vectorizer = url_vectorizer
x = url_vectorizer.transform(some_url)
y_predict = l_regress.predict(x)
for site in whitelisted_url:
some_url.append(site)
print(some_url)
l_predict = list(y_predict)
for j in range(0,len(whitelisted_url)):
l_predict.append('good')
print(l_predict)