Hands-On Machine Learning for Cybersecurity

In the first step, we will write a machine learning system using image-processing techniques that will be able to read letters from images.

We import the relevant packages; cv2 is the respective OpenCV package, as shown in the following code:

import os
import os.path
import cv2
import glob
import imutils

We read in the images, but we will output the respective letters in the images:

CAPTCHA_IMAGES_PATH = "input_captcha_images"
LETTER_IMAGES_PATH = "output_letter_images"

We list all of the CAPTCHA images that are present in the input folder and loop over all of the images:

captcha_images = glob.glob(os.path.join(CAPTCHA_IMAGES_PATH, "*"))
 counts = {}
 for (x, captcha_images) in enumerate(captcha_image_files):
 print("[INFO] processing image {}/{}".format(x + 1, len(captcha_image_files)))
filename = os.path.basename(captcha_image_file)
captcha_correct_text = os.path.splitext(filename)[0]

After loading the image, we convert it into grayscale and add extra padding to the image:

text_image = cv2.imread(captcha_image_file)
text_to_gray = cv2.cvtColor(text_image, cv2.COLOR_BGR2GRAY)
text_to_gray = cv2.copyMakeBorder(gray, 8, 8, 8, 8, cv2.BORDER_REPLICATE)

The image is converted into pure black and white, and the contours of the image are also found:

image_threshold = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]

image_contours = cv2.findContours(image_threshold.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

We need to check which version of OpenCV this is compatible with:

image_contours = image_contours[0] if imutils.is_cv2() else image_contours[1]
letterImage_regions = []

We loop through the image and get the contours on all of the sides with the corresponding rectangle where the contour is present:

for image_contours in image_contours:
    (x_axis, y_axis, wid, hig) = cv2.boundingRect(image_contours)

We compare the width and height to detect the corresponding letters:


 if wid / hig > 1.25:
 half_width = int(wid / 2)
 letterImage_regions.append((x_axis, y_axis, half_width, hig))
 letterImage_regions.append((x_axis + half_width, y_axis, half_width, hig))
 else:
 letterImage_regions.append((x_axis, y_axis, wid, hig))

If we detect more or less than five character access in the image provided, we ignore it, as it means that we have not cracked the CAPTCHA:

if len(letterImage_regions) != 5:
 continue

letterImage_regions = sorted(letterImage_regions, key=lambda x: x_axis[0])

We individually save all of the letters:


for letterboundingbox, letter_in_text in zip(letterImage_regions, captcha_correct_text):
x_axis, y_axis, wid, hig = letterboundingbox

letter_in_image = text_to_gray[y_axis - 2:y_axis + hig + 2, x_axis - 2:x_axis + wid + 2]

Finally, we save the image in the respective folder, as shown:


save_p = os.path.join(LETTER_IMAGES_PATH, letter_in_text)

if not os.path.exists(save_p):
 os.makedirs(save_p)

c = counts.get(letter_in_text, 1)
p = os.path.join(save_p, "{}.png".format(str(c).zfill(6)))
cv2.imwrite(p, letter_in_image)
counts[letter_in_text] = c + 1

Table of Contents for
Hands-On Machine Learning for Cybersecurity

Code

Table of Contents for Hands-On Machine Learning for Cybersecurity

Table of Contents for
Hands-On Machine Learning for Cybersecurity