Linux Shell Scripting Cookbook

The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments.

The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory:

while [ -n "$1" ] 
do  
  case $1 in 
  -d) shift; directory=$1; shift ;; 
   *) url=${url:-$1}; shift;; 
esac 
done

The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself.

The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL.

The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line:

$ ./img_downloader.sh -d DIR URL

Or:

$ ./img_downloader.sh URL -d DIR

egrep -o "<img src=[^>]*>" will print only the matching strings, which are the <img> tags including their attributes. The [^>]* phrase matches all the characters except the closing >, that is, <img src="image.jpg">.

sed's/<img src=\"$[^"]*$.*/\1/g' extracts the url from the src="url" string.

There are two types of image source paths: relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg.

For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes baseurl by extracting it from the initial URL with the following command:

baseurl=$(echo $url | egrep -o "https?://[a-z.\-]+")

The output of the previously described sed command is piped into another sed command to replace a leading / with baseurl, and the results are saved in a file named for the script's PID: (/tmp/$$.list).

sed "s,^/,$baseurl/," > /tmp/$$.list

The final while loop iterates through each line of the list and uses curl to download the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen.

Table of Contents for
Linux Shell Scripting Cookbook - Third Edition

How it works...

Table of Contents for Linux Shell Scripting Cookbook - Third Edition

Table of Contents for
Linux Shell Scripting Cookbook - Third Edition