It is crucial to understand the whole program first because then it is clear that all the CPU-intense work happens in one line of code in the main function:
transform(begin(v), end(v), begin(r), to_iteration_count);
The vector v contains all the indices that are mapped to complex coordinates, which are then in turn iterated over with the Mandelbrot algorithm. The result of each iteration is saved in vector r.
In the original program, this is the single line which consumes all the processing time for calculating the fractal image. All code that precedes it is just set up work and all code that follows it is just for printing. This means that parallelizing this line is key to more performance.
One possible approach to parallelizing this is to break up the whole linear range from begin(v) to end(v) into chunks of the same size and distribute them evenly across all cores. This way all cores would share the amount of work. If we used the parallel version of std::transform with a parallel execution policy, this would exactly be the case. Unfortunately, this is not the right strategy for this problem, because every single point in the Mandelbrot set shows a very individual number of iterations.
Our approach here is to make every single vector item which represents an individually printed character cell on the terminal later an asynchronously calculated future value. As source and target vector are w * h items large, which means 100 * 40 in our case, we have a vector of 4000 future values that are calculated asynchronously. If our system had 4000 CPU cores, then this would mean that we start 4000 threads that do the Mandelbrot iteration really concurrently. On a normal system with fewer cores, the CPUs will just process one asynchronous item after the other per core.
While the transform call with the asynchronized version of to_iteration_count itself does no calculation but setting up of threads and future objects, it returns practically immediately. The original version of the program blocked at this point because the iterations took so long.
The parallelized version of the program does of course block somewhere, too. The function that prints all our values on the terminal must access the results from within the futures. In order to do that, it calls x.get() on all the values. And this is the trick: while it waits for the first value to be printed, a lot of other values are calculated at the same time. So if the get() call of the first future returns, the next future might be ready for printing already too!
In case w * h results in much larger numbers, there will be some measurable overhead in creating and synchronizing all these futures. In this case, the overhead is not too significant. On my laptop with an Intel i7 processor with 4 hyperthreading capable cores (which results in eight virtual cores), the parallel version of this program ran more than 3-5 times faster compared to the original program. The ideal parallelization would make it indeed 8 times faster. Of course, this speedup will vary between different computers, because it depends on a lot of factors.