Users are discouraged from running applications directly on the login machine as runaway commands may bring the system down. Instead, the HPC cluster uses the SGE job scheduler to manage compute jobs. The scheduler takes care of launching your job on a compute node with appropriate resources.
Some programs, like R or paup, can be run interactively. If you wish to run an interactive program, you can use the qlogin command to automatically log you in to an available compute node. Once logged into the compute node, you can run your program as usual. If your program is SMP-parallel (multiple processors on a single node/machine) you can request a specific number of processors from a single machine. The following example requests a login session with 4 processors, all on a single machine:
qlogin -pe smp 4
>If the requested processors are not immediately available, the qlogin command will return with a message stating that the job could not be scheduled. You can tell qlogin to wait until the requested resources become available by passing -now no. For example:
qlogin -now no -pe smp 4
When you are done, just type logout and you’ll be returned to your original login session.
The cluster provides several “helper” commands (runpaup, rungamess, etc) that take care of the overhead of scheduling a program to run on a compute node.
Each of the “run” commands writes its standard output to PGM-USERNAME-###.out, and errors are written to PGM-USERNAME-###.err in the current working directory. PGM is the program name (talys, mrbayes, etc), USERNAME is your username (eg jsmith), and ### is the scheduler-assigned job number.
The following script is an example job submission script. While the script is a standard Bash shell script, you do not execute it directly, but rather pass it to the qsub command. For this reason, consider using the convention of saving such files with a .qsub extension. Parameters for SGE are placed in special comments beginning with #$. For a complete description, see the qsub man page (man qsub).
#!/bin/bash# example.qsub# Example qsub job submission script for launching a parallel MrBayes job# run the job in the current working directory #$ -cwd # Job name:#$ -N example-job # Use the orte (OpenMPI) runtime environment with 16 CPUs#$ -pe orte 16 # Send email upon job end/abort/suspend to the specified address# Possible values for -m are: # 'b' Mail is sent at the beginning of the job.# 'e' Mail is sent at the end of the job.# 'a' Mail is sent when the job is aborted or rescheduled.# 's' Mail is sent when the job is suspended.# 'n' No mail is sent.#$ -M email@example.com -m eas# Output file name, all stdout goes here#$ -o mrbayes-out# Errors output file name, all stderr is written to this file#$ -e mrbayes-err# Export all current environment variables to the scheduled job#$ -V# Execute the program mpirun -np 16 /share/apps/mrbayes/mb-multi myinput.nex
Programs using a parallel library other than MPI (eg GAMESS) can also be launched in this manner, as well as serial programs such as Paup.
To schedule the job defined in example.qsub, use the qsub command:
Jobs may be monitored from the command line with the qstat command or via the web on the cluster’s Job Queue web page.
Jobs launched via the “run” commands as well as custom jobs and qlogin sessions are assigned a job ID and can be monitored with the qstat command which lists all of your running and scheduled jobs. To see a listing of all running and scheduled jobs for all users type qstat -u ‘*’. When your job completes, it will no longer be listed in the qstat output.
Here is an example of user jsmith monitoring a job.
[jsmith@login-0-0 best]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID
5842 0.00000 best-jsmi jsmith qw 12/14/2009 12:18:07 4
The output of the qstat command shows a list of your jobs and what state they’re in. In this case there is only one job (job 5842) running and it is in the qw or “queued, waiting” state. The qstat command will show an ‘r’ for jobs that are running, and an ‘e’ for jobs in an error state.
On campus users can view a web page listing all running and scheduled jobs on the Job Queue page.
To cancel a job use the qdel command and pass it the job id of the job you’d like to terminate. Here’s an example:
[jsmith@login-0-0 ~]$ qdel 5842jsmith has registered the job 5842 for deletion