MPI Foundations @ TACC

Contact: Victor Eijkhout

Res: - Webcast I - webcast II

Terminology Socket: the processor chip Processor: we don’t use that word Core: one instruction-stream processing unit Process: preferred terminology in talking about MPI.

  • SPMD model: simple program multiple data


prototype: declarnation

Collectives Gathering * reduction: reduce, n to 1 MPI_Reduce * gather: collect, subset to one set Spreading * broadcast: identical, 1 to n * scatter: subsetting MPI_Scatterv

int MPI_Reduce(
    void* send_data, # use buffer, &x
    void* recv_data,
    int count, # size of x
    MPI_Datatype datatype,
    MPI_Op op,
    int root, # not needed for MPI_Allreduce
    MPI_Comm communicator)

int MPI_Gather(
   void *sendbuf, int sendcnt, MPI_Datatype sendtype,
   void *recvbuf, int recvcnt, MPI_Datatype recvtype,
   int root, MPI_Comm comm)

int MPI_Scatter
   (void* sendbuf, int sendcount, MPI_Datatype sendtype,
    void* recvbuf, int recvcount, MPI_Datatype recvtype,
    int root, MPI_Comm comm)

MPI_MAX - Returns the maximum element. MPI_MIN - Returns the minimum element. MPI_SUM - Sums the elements. MPI_PROD - Multiplies all elements. MPI_LAND - Performs a logical and across the elements. MPI_LOR - Performs a logical or across the elements. MPI_BAND - Performs a bitwise and across the bits of the elements. MPI_BOR - Performs a bitwise or across the bits of the elements. MPI_MAXLOC - Returns the maximum value and the rank of the process that owns it. MPI_MINLOC - Returns the minimum value and the rank of the process that owns it.


void pointer: address in memory of the data


write &x or (void*)&x for scalar


comm.recv, slow but python object; comm.Recv, fast

Scan or 'parallel prefix'

Barrier: Synchronize procs

  • almost never needed
  • only for timing

Naive realization

  • root to all \(\alpha + \beta*n\)
  • better, p2p

Distributed data

  • local_index + rank
  • global_index

Load balancing

\(f(i) = \floor(iN/p)\) given proc \(i\): \([f(i),f(i+1)]\)

Local info. exchange

Matrices in parallel: distribute domain, not the matrix

p2p: ping-pong two side: A & B match: send & recv

int MPI_Send(
  const void* buf, int count, MPI_Datatype datatype,
  int dest, int tag, MPI_Comm comm)

IN buf: initial address of send buffer (choice)
IN count: number of elements in send buffer (non-negative integer)
IN datatype: datatype of each send buffer element (handle)
IN dest: rank of destination (integer)
IN tag: message tag (integer)
IN comm: communicator (handle)

send & recv

communication across nodes is x100 slower than within nodes


send & recv are blocking operations

  • deadlock
  • might work

Bypass blocking

  • odds & evens: exchange sort, compare-and-swap
  • pairwise exchange


  • isend, irecv

need MPI_Wait as blocking to refresh the buffer; buffer is safe in blocking comm.

Latency hiding: don't need to hold until comm is done TEST: non-blacking wait, do local work if no incoming msg by test


MPI_Bsend, MPI_Ibsend: buffered send MPI_Ssend, MPI_Issend: synchronous send MPI_Rsend, MPI_Irsend: ready send

  • One-sided communication: ‘just’ put/get the data somewhere
  • Derived data types: send strided/irregular/inhomogeneous data
  • Sub-communicators: work with subsets of MPI_COMM_WORLD
  • I/O: efficient file operations
  • Non-blocking collectives
Published: Fri 14 April 2017. By Dongming Jin in

Comments !