Spidering websites

Many tools provide the ability to map out websites, but often you are limited to style of output or the location in which the results are provided. This base plate for a spidering script allows you to map out websites in short order with the ability to alter them as you please.

Getting ready

In order for this script to work, you'll need the BeautifulSoup library, which is installable from the apt command with apt-get install python-bs4 or alternatively pip install beautifulsoup4. It's as easy as that.

How to do it…

This is the script that we will be using:

import urllib2 
from bs4 import BeautifulSoup
import sys
urls = []
urls2 = []

tarurl = sys.argv[1] 

url = urllib2.urlopen(tarurl).read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
    newline = line.get('href')
    try: 
        if newline[:4] == "http": 
            if tarurl in newline: 
            urls.append(str(newline)) 
        elif newline[:1] == "/": 
            combline = tarurl+newline urls.append(str(combline)) except: 
               pass

    for uurl in urls: 
        url = urllib2.urlopen(uurl).read() 
        soup = BeautifulSoup(url) 
        for line in soup.find_all('a'): 
            newline = line.get('href') 
            try: 
                if newline[:4] == "http": 
                    if tarurl in newline:
                        urls2.append(str(newline)) 
                elif newline[:1] == "/": 
                    combline = tarurl+newline 
                    urls2.append(str(combline)) 
                    except: 
                pass 
            urls3 = set(urls2) 
    for value in urls3: 
    print value

How it works…

We first import the necessary libraries and create two empty lists called urls and urls2. These will allow us to run through the spidering process twice. Next, we set up input to be added as an addendum to the script to be run from the command line. It will be run like:

$ python spider.py http://www.packtpub.com

We then open the provided url variable and pass it to the beautifulsoup tool:

url = urllib2.urlopen(tarurl).read() 
soup = BeautifulSoup(url)

The beautifulsoup tool splits the content into parts and allows us to only pull the parts that we want to:

for line in soup.find_all('a'): 
newline = line.get('href')

We then pull all of the content that is marked as a tag in HTML and grab the element within the tag specified as href. This allows us to grab all the URLs listed in the page.

The next section handles relative and absolute links. If a link is relative, it starts with a slash to indicate that it is a page hosted locally to the web server. If a link is absolute, it contains the full address including the domain. What we do with the following code is ensure that we can, as external users, open all the links we find and list them as absolute links:

if newline[:4] == "http": 
if tarurl in newline: 
urls.append(str(newline)) 
  elif newline[:1] == "/": 
combline = tarurl+newline urls.append(str(combline))

We then repeat the process once more with the urls list that we identified from that page by iterating through each element in the original url list:

for uurl in urls:

Other than a change in the referenced lists and variables, the code remains the same.

We combine the two lists and finally, for ease of output, we take the full list of the urls list and turn it into a set. This removes duplicates from the list and allows us to output it neatly. We iterate through the values in the set and output them one by one.

There's more…

This tool can be tied in with any of the functionality shown earlier and later in this book. It can be tied to Getting Screenshots of a website with QtWeb Kit to allow you to take screenshots of every page. You can tie it to the email address finder in the Chapter 2, Enumeration, to gain email addresses from every page, or you can find another use for this simple technique to map web pages.

The script can be easily changed to add in levels of depth to go from the current level of 2 links deep to any value set by system argument. The output can be changed to add in URLs present on each page, or to turn it into a CSV to allow you to map vulnerabilities to pages for easy notation.

Previous Chapter

Screenshots based on a port list

Next Chapter

2. Enumeration

Table of Contents for Python Web Penetration Testing Cookbook

Spidering websites

Getting ready

How to do it…

How it works…

There's more…

Table of Contents for
Python Web Penetration Testing Cookbook