We are going to define a regular expression that detects links, and we apply it to an HTML file in order to pretty print all the links that occur in that file:
- Let's first include all the necessary headers, and declare that we use the std namespace by default:
#include <iostream>
#include <iterator>
#include <regex>
#include <algorithm>
#include <iomanip>
using namespace std;
- We will later generate an iterable range, which consists of strings. These strings always occur in pairs of a link and a link description. Therefore, let's write a little helper function, which pretty prints these:
template <typename InputIt>
void print(InputIt it, InputIt end_it)
{
while (it != end_it) {
- In each loop step, we increment the iterator twice and take copies of the link and the link description they contain. Between the two iterator dereferences, we add another guarding if branch that checks whether we prematurely reached the end of the iterable range, just for safety:
const string link {*it++};
if (it == end_it) { break; }
const string desc {*it++};
- Now, let's print the link with its description in a nicely prettified form and that's it:
cout << left << setw(28) << desc
<< " : " << link << 'n';
}
}
- In the main function, we are reading in everything that comes from standard input. To do this, we are constructing a string from the whole standard input via an input stream iterator. In order to prevent tokenizing, because we want the whole user input as-is, we use noskipws. This modifier deactivates whitespace skipping and tokenizing:
int main()
{
cin >> noskipws;
const std::string in {istream_iterator<char>{cin}, {}};
- Now we need to define a regular expression that describes how we assume an HTML link to look. The parentheses, (), within the regular expression define groups. These are the parts of the link we want to access--the URL it links to, and its description:
const regex link_re {
"<a href="([^"]*)"[^<]*>([^<]*)</a>"};
- The sregex_token_iterator class has the same look and feel as of istream_iterator. We give it the whole string as iterable input range and the regular expression we just defined. There is also a third parameter, {1, 2}, which is an initializer list of integer values. It defines that we want to iterate over the groups 1 and 2 from the expressions it captures:
sregex_token_iterator it {
begin(in), end(in), link_re, {1, 2}};
- Now we have an iterator that will emit the links and link descriptions if it finds any. We provide it together with a default constructed iterator of the same type to the print function we implemented before:
print(it, {});
}
- Compiling and running the program gives us the following output. I ran the curl program on the ISO C++ homepage, which simply downloads an HTML page from the Internet. Of course, it would also be possible to write cat some_html_file.html | ./link_extraction. The regular expression we used is pretty much hardcoded to a fixed assumption of how links look in the HTML document. It may be exercised by you to make it more general:
$ curl -s "https://isocpp.org/blog" | ./link_extraction
Sign In / Suggest an Article : https://isocpp.org/member/login
Register : https://isocpp.org/member/register
Get Started! : https://isocpp.org/get-started
Tour : https://isocpp.org/tour
C++ Super-FAQ : https://isocpp.org/faq
Blog : https://isocpp.org/blog
Forums : https://isocpp.org/forums
Standardization : https://isocpp.org/std
About : https://isocpp.org/about
Current ISO C++ status : https://isocpp.org/std/status
(...and many more...)