Introduction to MPI on the CRAY Darter

Parallel Programming with MPI on the CRAY Darter
Joint Institute for Computational Sciences
March 27, 2015

Purpose

The purpose of this lab is to become familiar with MPI account setup and documentation as well as compiling and running elementary MPI programs on CRAY Darter.

Introduction to MPI on the CRAY Darter
Other Important Information

Workstation Login and Environment Setup

First of all, you need to login to your temporary account on Darter via the Secure Shell protocol (SSH). To do this please refer to the NICS User Support page.
On Darter, access to compute resources is managed by the Portable Batch System (PBS). Batch scripts are run on service nodes that have access to the home, project and software directories. Executables launched with the aprun command do not have access to these directories; they have access only to the Lustre scratch directories, /lustre/medusa/YOUR_USER_ID.
In LAB-EXAMPLE folder you will find three subdirectories:

C/
F77/
INSTRUCTION/

First two folders contain all the material necessary for the workstation exercises written in either C or Fortran77. Select the directory that suits your programming preferences. You are now ready to proceed into the world of MPI.

MPI Programming - Exercise Information

In this lab you will utilize the most fundamental MPI calls necessary for development of any MPI code, as well as learn how to compile and run MPI code. During this lab you will encounter the following exercises:

Hello World 1 - exercise (hello1directory)
Hello World 2 - exercise (hello2directory)
Pi Calculation - exercise (pical directory)
Timing an MPI Code - exercise (timing directory)
PingPong - exercise (pingpong directory)
Ring 1 - exercise (ring directory)
Simple Array Assignment - exercise (array directory)
Matrix Multiplication - exercise (matmul directory)
Laplace Equation - exercise (laplace directory)

But first, the following . . .

General Information

During the workshop exercises you may be asked to write a code; if possible write the codes on your own. To do so use the files with the .start extensions that are provided in the directories where programming is required.

For those of you wishing to concentrate only on the message passing aspects of the code, files with .template extensions have been provided in the exercise directories that require programming. To modify the template files, first copy filename._.template to filename.c or filename.f depending on whether you are using C or FORTRAN. For example:

% cp filename.c.template filename.c
% cp filename.f.template filename.f

Next, invoke your favorite text editor and modify the template by replacing all of the XXXXX's with the appropriate MPI calls.

For your convenience, there are completed solutions to the programming exercises available as files with .soln extensions. Remember, the only way to learn how to program is by actually programming, SO LOOK AT THE SOLUTIONS ONLY AS A LAST RESORT.

Finally, extra exercises have been provided at the end of some of the main exercises. Please, do not work on these until after you have completed the main exercises for the day. These extra exercises have been provided without templates or solutions, and your lab assistants may be able to give you only very general help with them. In other words, try them at your own risk.

Hello World 1 - The Minimal MPI Program

The objective of this exercise is not to write a code but to demonstrate the fundamentals of compiling an MPI program and submitting it via qsub.

Examine the "Hello World!" program hello.c/hello.f. Notice that every process prints "Hello World!" and that the "Hello World!" program:

Includes a header,
Initializes MPI,
Prints a "Hello World!" message, and
Finalizes MPI

Compile "Hello World!". For the version of MPI that we are using there are several ways to compile a program. We will use the commands cc and ftn to compile our C and FORTRAN programs, respectively. These compilers locate the MPI libraries and header files as needed. To compile, you can either compile at the prompt or use the provided makefile. If you use the provided makefile, first read it to understand what it is doing.

In general, to compile a program called filename.c/filename.f, at the command line you need to enter the following:

% cc filename.c -o filename or
% ftn filename.f -o filename

where filename is the resulting executable.

To compile a program called filename.c/filename.f using the provided makefile, enter the following:

% make

in the appropriate directory. The resulting executable will be filename.

For the "Hello World!" program, enter either

% cc hello.c -o hello for C programs, or
% ftn hello.f -o hello for FORTRAN programs

or, enter

% make

Again, if you use the provided makefile, first make sure you understand what it is doing.

Now we want to run the "Hello World!" program. This elementary problem will use 8 processors and will assign a rank to each of them. Then the program will output 8 lines depending on the rank of the process.

Hello World (from masternode)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)
Hello WORLD!!! (from worker node)

Example Batch Script

1: #!/bin/bash
2: #PBS -A XXXYYY
3: #PBS -N test
4: #PBS -j oe
5: #PBS -l walltime=1:00:00,size=256
6:
7: cd $PBS_O_WORKDIR
8: date
9: aprun -n 256 ./a.out

This batch script can be broken down into the following sections:

Shell interpreter
- Line 1
- Can be used to specify an interpreting shell.
PBS commands
- The PBS options will be read and used by PBS upon submission.
- Lines 2–5
  - 2: The job will be charged to the XXXYYY project.
  - 3: The job will be named “test.”
  - 4: The jobs standard output and error will be combined.
  - 5: The job will request 256 cores for 1 hour.
- Please see the PBS Options page for more options.
Shell commands
- Once the requested resources have been allocated, the shell commands will be executed on the allocated nodes head node.
- Lines 6–9
  - 6: This line is left blank, so it will be ignored.
  - 7: This command will change directory into the script's submission directory. We assume here that the job was submitted from a directory in /lustre/medusa/.
  - 8: This command will run the date command.
  - 9: This command will run the executable a.out on 256 cores with a.out.

Submitting Batch Jobs

Batch scripts can be submitted for execution using the qsub command on Darter. For example, the following will submit the batch script named test.pbs:

% qsub test.pbs

If successfully submitted, a PBS job ID will be returned. This ID can be used to track the job.

For more information about qsub see the man pages.

Look at the running job page for more (and some redundant) information. In particular, look at the PBS commands for submitting jobs, removing jobs from the queue, etc.

We will use the PBS pbssub script to submit our job for the hello1 exercise. After compiling the source code you will find an executable file in the current hellol directory. To submit the job to the CRAY Darter queue use the provided pbssubfile, but first be sure to examine the file.

% qsub pbssub

Did the job enter the batch queue? Check, with the showqcommand. Where did the job run?

Hello World 2 - Hello Again!

The objective of this exercise it to become familiar with the basic MPI routines used in almost any MPI program. You are asked to write an SPMD(Single Process, Multiple Data) program where, again, each process checks its rank, and decides if it is the master (if its rank is 0), or a worker (if its rank is 1 or greater).

The SPMD programs should:

Include the header,
Initialize MPI,
Check its rank, and

if the process is the master, then send a "Hello World!" message, in characters, to each of the workers
if the process is a worker, then receive the "Hello World!" message and print it out

Finalize MPI

Compile your program at the command line or via the makefile. Run this code on 8 processes using qsub pbssub command. You can also run the code on 16, 24 etc processes keeping in mind that the number of requested cores must be the multiple of 8.

Pi Calculation

This program calculates pi using a simple integration of a tangent function.

Review the mpi_pi.c/mpi_pi.f code to get an idea of what it does.
Compile the code. A makefile has been provided for you. To compile with the makefile, simply type:make. Or to compile at the command line use the cc compiler for the C code or the ftn compiler for the FORTRAN code:
% cc mpi_pi.c -o mpi_pi
% ftn mpi_pi.f -o mpi_pi
Run the code: mpi_pi using provided pbssub file

Timing an MPI code

The objective of this exercise is to investigate the amount of time required for message passing between two processes.

In this exercise different size messages are sent back and forth between two processes a number of times. Timings are made for each message before it is sent and after it has been received. The difference is computed to obtain the actual communication time. Finally, the average communication time and the bandwidth are calculated and output to the screen.

We will run this code on two nodes (one process on each node) passing messages of length 1, 100, 10,000, and 1,000,000. You can record your results in a table like the one below.

	`communication`
`length`	`time (uSec)`	`bandwidth (Megabit/Sec)`
`1`	`0.000001`	`65.440140`
`100`	`0.000002`	`2936.930591`
`10,000`	`0.000052`	`12321.465896`
`1,000,000`	`0.005133`	`12468.521884`

A makefile and pbssub files have been provided:
to compile the code, type make
to submit the job, type qsub pbssub

PingPong - Calculating Transfer Rates

The objective of this exercise is to introduce some intermediate MPI features, and to understand how a possible deadlock situation can occur during message passing.

Write a program (pingpong) in which two processes pass a message (a certain number of real or float numbers) back and forth (perhaps 100 times). You will use the MPI_Wtime() routine as a timer in the following exercise. This routine returns a time expressed in seconds, so in order to time something, two calls are needed and the difference should be taken between them to obtain the total elapsed time (in wall clock seconds).

In the program, pingpong, it is safer to use MPI_Ssend, since MPI_Send may or may not be synchronous, and its use may result in a deadlock situation.
Compile (make) pingpong and run it: (cc -o pingpong pingpong.c).
Insert timing calls (see man MPI_Wtime) to estimate the time taken for one message on a one way trip. Calculate the transfer rate in bytes per second. What did you find?
Add a loop around the timing calls changing the length of the message (length varies from 1 to 10001 in steps of 1000) to investigate how the time taken varies with the size of the message.

Ring 1 - Sending Messages around a Ring

Consider a set of processes arranged in a ring as shown below. Use a token passing method to compute the sum of the ranks of the processes.

1
/ \
0 2
\ /
3

Figure 1: Four processors arranged in a ring, messages are sent from 0 to 1 to 2 to 3 to 0 again, sum of ranks is 6.

Each processor stores its rank in MPI_COMM_WORLD in an integer and sends this value to the processor on its right. It then receives an integer from its left neighbor. It keeps track of the sum of all the integers received. The processors continue passing on the values they receive until they get their own rank back. Each process should finish by printing out the sum of the values. Use synchronous sends MPI_Ssend() (blocking) or MPI_Issend() (non-blocking) for this program. Watch out for deadlock situations. If you use non-blocking sends, make sure that you do not overwrite information. You are asked to use synchronous message passing because the standard send can be either buffered or synchronous, and you should learn to program for either possibility.

Simple Array Assignment

This is a simple array assignment used to demonstrate the distribution of data among multiple tasks and the communications required to accomplish that distribution.

The master distributes an equal portion of the array to each worker. Each worker receives its portion of the array and performs a simple value assignment to each of its elements. Each worker then sends its portion of the array back to the master. As the master receives a portion of the array from each worker, selected elements are displayed.

Note: For this example, the number of processes should be set to an odd number(aprun -n 7), to ensure even distribution of the array to numtasks-1 worker tasks.

Matrix Multiplication

This example is a simple matrix multiplication program.

The data is distributed among the workers who perform the actual multiplication in smaller blocks and send back their respective results to the master.

Note: The C and FORTRAN versions of this code differ because of the way arrays are stored/passed. C arrays are stored in row-major order while FORTRAN arrays are stored in column-major order.

2D Laplace Equation

This example solves a two-dimensional Laplace equation using the point Jacobi iteration method over a rectangular domain. The initial guess value of the function is zero. The boundaries are held at 100 throughout the calculation. Domain decomposition will be used for the parallel implementation of the problem. To run this exercise, run it on 4 processes (aprun -n 4).

Logging Off

Type exit to close the connection with the CRAY Darter machine.

Documentation for MPI and mpich, and additional resources

There are man pages available for MPI which should now be installed in your MANPATH. Look at the following man pages to see some introductory information about MPI.

% man MPI
% man cc
% man ftn
% man qsub
% man MPI_Init
% man MPI_Finalize

You can also refer to NICS User Support Page for Darter specific MPI Implementation details.

The MPI man pages are also available online.

The WWW home for http://www.mpich.org is at the Argonne National Lab. They also maintain a general MPI page.

Acknowledgments

The original MPI training materials for workstations were developed under the Joint Information Systems Committee (JISC) New Technologies Initiative by the Training and Education Centre at Edinburgh Parallel Computing Centre (EPCC-TEC), University of Edinburgh, United Kingdom.

Thanks also to Blaise Barney from Cornell University Theory Center for his modifications of the labs available through the MHPCC on the World Wide Web. These labs have since been modified for this workshop.

Joint Institute for Computational Sciences

Introduction to MPI on the CRAY Darter

Parallel Programming with MPI on the CRAY Darter Joint Institute for Computational Sciences March 27, 2015

Purpose

Contents

Example Batch Script

Submitting Batch Jobs

1 / \ 0 2 \ / 3

Parallel Programming with MPI on the CRAY Darter
Joint Institute for Computational Sciences
March 27, 2015

1
/ \
0 2
\ /
3