Expert C++ Programming

We are going to define a regular expression that detects links, and we apply it to an HTML file in order to pretty print all the links that occur in that file:

Let's first include all the necessary headers, and declare that we use the std namespace by default:

      #include <iostream>
      #include <iterator>
      #include <regex>
      #include <algorithm>
      #include <iomanip>      

      using namespace std;

We will later generate an iterable range, which consists of strings. These strings always occur in pairs of a link and a link description. Therefore, let's write a little helper function, which pretty prints these:

      template <typename InputIt>
      void print(InputIt it, InputIt end_it)
      {
          while (it != end_it) {

In each loop step, we increment the iterator twice and take copies of the link and the link description they contain. Between the two iterator dereferences, we add another guarding if branch that checks whether we prematurely reached the end of the iterable range, just for safety:

              const string link {*it++};
              if (it == end_it) { break; }
              const string desc {*it++};

Now, let's print the link with its description in a nicely prettified form and that's it:

              cout << left << setw(28) << desc 
                   << " : " << link << 'n';
          }
      }

In the main function, we are reading in everything that comes from standard input. To do this, we are constructing a string from the whole standard input via an input stream iterator. In order to prevent tokenizing, because we want the whole user input as-is, we use noskipws. This modifier deactivates whitespace skipping and tokenizing:

      int main()
      {
          cin >> noskipws;
          const std::string in {istream_iterator<char>{cin}, {}};

Now we need to define a regular expression that describes how we assume an HTML link to look. The parentheses, (), within the regular expression define groups. These are the parts of the link we want to access--the URL it links to, and its description:

          const regex link_re {
              "<a href="([^"]*)"[^<]*>([^<]*)</a>"};

The sregex_token_iterator class has the same look and feel as of istream_iterator. We give it the whole string as iterable input range and the regular expression we just defined. There is also a third parameter, {1, 2}, which is an initializer list of integer values. It defines that we want to iterate over the groups 1 and 2 from the expressions it captures:

          sregex_token_iterator it {
              begin(in), end(in), link_re, {1, 2}};

Now we have an iterator that will emit the links and link descriptions if it finds any. We provide it together with a default constructed iterator of the same type to the print function we implemented before:

          print(it, {});
      }

Compiling and running the program gives us the following output. I ran the curl program on the ISO C++ homepage, which simply downloads an HTML page from the Internet. Of course, it would also be possible to write cat some_html_file.html | ./link_extraction. The regular expression we used is pretty much hardcoded to a fixed assumption of how links look in the HTML document. It may be exercised by you to make it more general:

      $ curl -s "https://isocpp.org/blog" | ./link_extraction 
      Sign In / Suggest an Article : https://isocpp.org/member/login
      Register                     : https://isocpp.org/member/register
      Get Started!                 : https://isocpp.org/get-started
      Tour                         : https://isocpp.org/tour
      C++ Super-FAQ                : https://isocpp.org/faq
      Blog                         : https://isocpp.org/blog
      Forums                       : https://isocpp.org/forums
      Standardization              : https://isocpp.org/std
      About                        : https://isocpp.org/about
      Current ISO C++ status       : https://isocpp.org/std/status
      (...and many more...)

Table of Contents for
Expert C++ Programming

How to do it...

Table of Contents for Expert C++ Programming

Table of Contents for
Expert C++ Programming