|Search Duke CSL||
The Sun Grid Engine (SGE) system manages the department batch queue. Grid Engine runs jobs on the departmental and research compute nodes.
The CS SGE Engine setup organizes compute resources into two queues.
Two additional queues exist to hold computers owned by specific research groups.
Jobs queued in compsci are low-priority jobs in SGE parlance. Low-priority jobs have the advantage that they can run on the nodes owned by research groups, such as the architecture group of the Donald Lab. This means low-priority jobs have the largest pool of potential machines to run on. However, if a high-priority job is submitted when all resources are utilized, a low-priority job will be slowed down by 95% to give the high priority job 95% of the CPU.
For the basics of Grid Engine operation, please see the following links
All jobs submitted to Grid Engine must be shell scripts, and must be submitted from one of the cluster machines. Grid Engine will scan the script text for qsub option flags. The same flags can be on the qsub command or embedded in the script. Lines in the script beginning with #$ will be interpretted as containing qsub flags.
The following job runs the program hostname. The script passes gridengine the -cwd flag to run the job in current working directory when qsub was executed. This is the equivalent of running: qsub -cwd job.sh.
#!/bin/sh #$ -cwd hostname
Here is a sample of mpich2 on Grid Engine. This script will run in the grisman_mpich parallel environement with 2 slave processes.
#!/bin/csh -f # --------------------------- # job name #$ -N MPI_Job # # pe request #$ -pe grisman_mpich2 2 # # Operate in current working directory #$ -cwd # # --------------------------- export MPIEXEC_RSH=/usr/bin/rsh mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines my_mpiprogram
You can use the program cluster_scan or qstat to monitor the cluster.
Please be aware that compute cluster machines are not backed up. Users should copy any important data to filesystems that are backed up to avoid losing data. In addition, try to be cognizant that this is a shared resource. Please minimize the network traffic for shared resources like disk space. If you need to read and write lots of data, please copy that to local disks, compute the results, and store the results on longer term storage.