One might think that it would be easiest to use MPI to allocate one instance of the MPI application to a single CPU core on each cluster node, and this would be true. It would, however, not be the fastest solution.
Although for communication between processes across a network MPI is likely the best choice in this context, within a single system (single or multi-CPU system) using multithreading makes a lot of sense.
The main reason for this is simply that communication between threads is significantly faster than inter-process communication, especially when using a generalized communication layer such as MPI.
One could write an application that uses MPI to communicate across the cluster's network, whereby one allocates one instance of the application to each MPI node. The application itself would detect the number of CPU cores on that system, and create one thread for each core. Hybrid MPI, as it's often called, is therefore commonly used, for the advantages it provides:
- Faster communication – using fast inter-thread communication.
- Fewer MPI messages – fewer messages means a reduction in bandwidth and latency.
- Avoiding data duplication – data can be shared between threads instead of sending the same message to a range of processes.
Implementing this can be done the way we have seen in previous chapters, by using the multithreading features found in C++11 and successive versions. The other option is to use OpenMP, as we saw at the very beginning of this chapter.
The obvious advantage of using OpenMP is that it takes very little effort from the developer's side. If all that one needs is to get more instances of the same routine running, all it takes is are the small modifications to mark the code to be used for the worker threads.
For example:
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[]) {
int numprocs, rank, len;
char procname[MPI_MAX_PROCESSOR_NAME];
int tnum = 0, tc = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(procname, &len);
#pragma omp parallel default(shared) private(tnum, tc) {
np = omp_get_num_threads();
tnum = omp_get_thread_num();
printf("Thread %d out of %d from process %d out of %d on %sn",
tnum, tc, rank, numprocs, procname);
}
MPI_Finalize();
}
The above code combines an OpenMP application with MPI. To compile it we would run for example:
$ mpicc -openmp hellohybrid.c -o hellohybrid
Next, to run the application, we would use mpirun or equivalent:
$ export OMP_NUM_THREADS=8
$ mpirun -np 2 --hostfile my_hostfile -x OMP_NUM_THREADS ./hellohybrid
The mpirun command would run two MPI processes using the hellohybrid binary, passing the environment variable we exported with the -x flag to each new process. The value contained in that variable will then be used by the OpenMP runtime to create that number of threads.
Assuming we have at least two MPI nodes in our MPI host file, we would end up with two MPI processes across two nodes, each of which running eight threads, which would fit a quad-core CPU with Hyper-Threading or an octo-core CPU.