AESOP home

Help

GRAIL's camelot cluster

By Will Knottenbelt and Matt Johnson.


From Will:

[The camelot cluster]

The camelot cluster in the DoC machine room. Picture by Uli Harder. [Larger version]

The GRAIL project's camelot cluster is now available for use by AESOP members. It was bought for specific use on the GRAIL grant, but all AESOP members should please feel free to use it (and the associated file space) for their projects!

The camelot cluster is 16 dual processor, dual core nodes called camelot01..16. Each node is a Sun Fire x4100 with two 64-bit Opteron 275 processors and 8GB of RAM, and is connected with both gigabit ethernet and Infiniband interfaces. These run the 64-bit version of the current DoC Linux distribution; 32-bit binaries may run but there are no guarantees - you are strongly recommended to recompile any software to native 64-bit.

The Infiniband fabric runs at 2.5Gbit/s, managed by a Silverstorm 9024 switch.

A fileserver called excalibur (not for general login at the moment) provides NFS access to /vol/grail/ – a 4.7TByte filesystem, stored on a Sun Storedge 3120 across 12 500GB disks. This space is not currently backed up – CSG have plans to do this in the future but are not yet able to do so.


Further information from Matt Johnson, on MPI, Sun Grid Engine and MPI-universe.

MPI

Firstly, revised instructions for using standard, "unmanaged" MPI on the camelots:

  1. Rebuild your MPI source on camelot01! MVAPICH (the variant of MPICH which is built to use the Infiniband interconnect on the camelots) is:

    1. compiled 64-bit native
    2. does not quite use identical imports to standard "p4pe" or "p4ch" MPI!
  2. Run your MPI code as normal. Your machinefile can now just state the standard hostnames, e.g.:

    camelot01 camelot02 camelot03 …
    

    i.e. the "-ib" suffix is no longer required. (In fact, it was never required!)

    SSH on the camelots has been reconfigured, and now uses host-based authentication within the cluster. As a result, ssh-agents are no longer required.


Sun Grid Engine

On to Sun Grid Engine… Sun Grid Engine has been installed on the Camelot cluster. This allows jobs to be run in a batch environment, in both a vanilla universe and also an MPI/MVAPICH universe.

To submit a "vanilla" job:

  1. Write a job script. You can find an example of this in:

    /vol/grail/users/mwj/sge/testing/simple.sh
  2. Add the SGE tools to your environment.

    If you are a csh or tcsh user:

    source /vol/grail/software/sge-current/camelot/common/settings.csh

    If you are a bash, ksh or sh user:

    . /vol/grail/software/sge-current/camelot/common/settings.sh
  3. Submit the job. The tool for this is called qsub, and it (along with all the SGE tools) has a very good manpage – run man qsub. man sge_intro is also worth a read.

    qsub /path/to/yourscript.sh
  4. You can find out the state of your job by using qstat. Again, the man page gives much more information, but the simplest use is to just run:

    qstat

    If you do this immediately after you have submitted a job, it will probably look something like this:

    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
    -----------------------------------------------------------------------------------------------------------------
          25 0.00000 simple.sh  mwj          qw    07/08/2006 11:53:05                                    1

    So you have a job-ID, how long it has already been running (prior), the name of the job, who owns it, its state (Queued, Waiting), when it was submitted, and the number of slots it has been allocated (1).

    A little later, once SGE has found a match for the job, it should look something like:

    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
    -----------------------------------------------------------------------------------------------------------------
          25 0.55500 simple.sh  mwj          r     07/08/2006 11:54:41 all.q@camelot03.doc.ic.ac.uk       1

    Here, you can see the job has been running for about 0.5 seconds, it is in state Running, it started running at 11:54:41, and it is running on the all.q, specifically on camelot03.

    Once the job has finished, it will (rather unceremoniously) dump the output in your home directory, in files called:

    <jobname>.[eo]<jobID>

    e.g. "simple.sh.e25" and "simple.sh.o25" in this case.

    The "e" file contains the contents of STDERR, and the "o" file STDOUT.

    You can use qsub switches to change where this output is placed.

  5. If you want to delete a job, issue:

    qdel <jobID>

MPI-universe

Submitting an MPI-universe job is very, very similar!

  1. Write a job script. You can find an example of this in:

    /vol/grail/users/mwj/sge/testing/mvapich/mvapichl.sh

    You should call mpirun_rsh -ssh to launch your MPI job under SGE.

  2. Add the SGE tools to your environment.

    If you are a csh or tcsh user:

    source /vol/grail/software/sge-current/camelot/common/settings.csh

    If you are a bash, ksh or sh user:

    . /vol/grail/software/sge-current/camelot/common/settings.sh
  3. Submit the job. Very similar, but we need a little extra magic to tell SGE we are submitting into a Parallel Environment, and how many slots we wish to use for the job.

    qsub -pe mvapichl 4 /vol/grail/users/mwj/sge/testing/mvapich/mvapichl.sh

    This says: Add the job …/mvapichl.sh to the general queue, but use the mvapichl parallel environment with 4 slots.

    You can specify this in the job script as well, see mvapichl-inline.sh for an example. In this case, you would merely run:
    qsub /vol/grail/users/mwj/sge/testing/mvapich/mvapichl.sh

    …and the PE would automatically be used.

  4. Monitoring the job is very similar. qstat works the same way, but standard qstat only shows the processor on which rank 0 is running. In order to find out where the other processes are running, you can use:

    qstat -f

    The output from this is relatively self-explanatory.

  5. The output files are similar, but you will also find "pe" and "po" files, which are created by the parallel environment in setting up the machinefile for the MPI process.


Thanks to Matt for the thorough information.