The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments.
The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory:
while [ -n "$1" ]
do
case $1 in
-d) shift; directory=$1; shift ;;
*) url=${url:-$1}; shift;;
esac
done
The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself.
The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL.
The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line:
$ ./img_downloader.sh -d DIR URL
Or:
$ ./img_downloader.sh URL -d DIR
egrep -o "<img src=[^>]*>" will print only the matching strings, which are the <img> tags including their attributes. The [^>]* phrase matches all the characters except the closing >, that is, <img src="image.jpg">.
sed's/<img src=\"\([^"]*\).*/\1/g' extracts the url from the src="url" string.
There are two types of image source paths: relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg.
For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes baseurl by extracting it from the initial URL with the following command:
baseurl=$(echo $url | egrep -o "https?://[a-z.\-]+")
The output of the previously described sed command is piped into another sed command to replace a leading / with baseurl, and the results are saved in a file named for the script's PID: (/tmp/$$.list).
sed "s,^/,$baseurl/," > /tmp/$$.list
The final while loop iterates through each line of the list and uses curl to download the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen.