Deepthought2

From GeoWiki
Jump to navigation Jump to search


This is an overview of key areas to get you started with things common to those in our department. For a more general discussion, please check out the DIT Deepthought Usage .

Overview

The Deepthought2 supercomputer consists of 2 login nodes and >400 compute nodes. After logging into login.deepthought2.umd.edu you'll be assigned to either login-1 or login-2. The login nodes are to be used to compile code, move data, launch jobs, etc, and they have all the same software (and more) than the compute nodes. Please be sure not to use these nodes to directly run a multi-core job... odds are their IT will yell at us eventually. Jobs run on 1 or more of the compute nodes, accessible only from the login nodes. These 400 compute nodes consist of:

  • 20 cores
  • 128 GB of memory
  • 750 GB temporary hard drive space
  • FDR Infiniband network connection between nodes

There are additional nodes that have increased memory (1TB) or GPUs if needed.

Disk Space

There are several areas of disk space available on Deepthought for the users:

  • Home Directory - Everyone is given 10GB of space in their home directory. Access to the home directory is slow from the compute nodes, so try not to store data/programs needed while running in here. Also, you'll get warnings if launching an MPI job from here.
  • Lustre - 1 Petabyte of storage is available under /lustre/<username>. Although there is currently no user quotas for this file system, there will inevitably be one imposed eventually. Code to be run should be stored here, though this system is not backed up. Lustre can slow down considerably depending on how people are using DT2 at that time, so to make your experiments as fast as possible it is best to make sure the bulk of the data you need is copied locally to the node's /tmp folder at run time.
  • Scratch Disk - Each compute node has 750GB of space available under /tmp/. Faster than the Lustre directory, though not able to be shared between nodes. This drive is erased after a job's completion.
  • Ram Disk - Each compute node can use part of its 128GB of memory as a ram disk located under /dev/shm/ This is temporary disk space similar to /tmp, however it resides entirely in memory, and so will be extremely fast.

Accounts

Every job submitted to the cluster is billed to an account. You can find out the accounts you have access to by using sbalance on the command line.

Allocations

When submitting a job using sbatch -A <account> <script> or using #SBATCH -A <account> within a script, there are two main categories of pools to use:

  • High Priority - These are our main pool of hours and are replenished at the beginning of each month with 610,000 CPU hours. If there are other jobs waiting to run on Deepthought, your job will be placed in the queue with a high priority. These queues must absolutely be used first, as they disappear at the end of the month. Moreover, usage of standard queues below means you are eating into next month’s High Priority allocations. Select between ved-prj-hi, ved-lab-hi, schmerr-lab-hi or schmerr-prj-hi.
  • Standard - These accounts queue jobs at a lower priority. They are the hours left over from the previous month, plus hours from the next month that you can borrow from. So, try to limit usage once this gets below 610kSU. Select between ved-prj, ved-lab, schmerr-lab or schmerr-prj

In addition to specifying an account to charge when running a job, the partition used can be specified with sbatch -A <account> -p <partition> <script>. Normally you should only do this in 2 circumstances:

  • scavenger - by running with sbatch -A <account> -p scavenger your job will run only if there are no other normal jobs in the queue. Although no hours will be charged to the account, your job may be interrupted if a normal job enters the queue and there are no other nodes available. Your job will be put back in the queue and will wait to run again. Your job script must therefore be able to be easily stopped and restarted. The benefit of scavenger is of course that no hours will be charged to the department. This might be quite useful for the data assimilatoin people doing their usual DA cycles that can easily be restated.
  • debug - by running with sbatch -A <account> -p debug your job will be placed in the queue with a high priority (regardless of the account specified) though will only run for a maximum of 15 minutes.

Viewing Remaining Hours

Running sbalance --all will show how many hours are remaining for our department in the given month (e.g. ved-lab-hi) as well as the hours leftover from the previous month plus hours that can stolen from the next month (e.g. ved-lab). Additionally, you can see the individual usage for the members in our department.

Running Jobs

The usual way of running a job is to create a script file that is submitted to the scheduling system with the sbatch command. Extensive details on this can be found at DIT's info on running jobs. In summary, your script will consist of at least the following lines at the top:

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 2:00:00

indicating the number of nodes you want, and the maximum running time. The script is then placed in the queue with sbatch -A <account> script_name

Keep in mind that each of the nodes has 20 cores, and using any core on a node will result in being charged for usage of the entire node, so optimize your configuration accordingly (i.e. it would be a waste to request 22 cores since you would be charged for 40 cores) So, for example, using 10 nodes for a whole day would charge 4,800 hours to the department's account.

Debugging

For debugging purposes, instead of running directly on login mode, it is recommended to request a node first with salloc command, for example:

login-1:~ salloc -A ved-job -p debug -t 15:00
salloc: Granted job allocation 5227967
salloc: Waiting for resource configuration
salloc: Nodes compute-b28-47 are ready for job

The above command successfully requested a computing node for 15 minutes (-p debug gives a higher priority in the queue, but limit time to 15 minutes; drop this option if a longer time is needed). Then:

ssh -Y compute-b28-47.deepthought2.umd.edu

Checking run status

To view a list of all jobs you have running, you can use the squeue command, for example:

      login-1:~ squeue -u moulik
       JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     2736589 ved-lab-hi test1   moulik  R   20:14:29      1 compute-b20-4
     2736588 ved-lab-hi test2   moulik  R   20:15:45      1 compute-b20-2
 

Email Notification

To receive email notification of your job finishing (or crashing) you can set the --mail-type= and --mail-user= parameters at the top of your job's batch script, for example:

#!/bin/bash
#SBATCH -n 20
#SBATCH -t 1:00:00
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=moulik@umd.edu

Sample SLURM Scripts

Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on Deepthought2. If you choose to copy one of these sample scripts, please make sure you understand what each line of the sbatch directives before using it to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources.

Basic, single-processor job

This script can serve as the template for many single-processor applications. The mem-per-cpu flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the -o (can also use --output) line tells SLURM to substitute the job ID in the name of the output file. You can also add a -e or --error with an error file name to separate output and error logs.

Download the [{{#filelink: single_job.sh}} single_processor_job.sh] script {{#fileanchor: single_job.sh}}

#!/bin/sh
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --mail-type=ALL               # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>   # Where to send mail	
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=600mb                   # Memory limit
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.out   # Standard output and error log

pwd; hostname; date

module load python

echo "Running plot script on a single CPU core"

# Run your program with correct path and command line options
./YOURPROGRAM INPUT
#python /homes/moulik/plot_template.py

date

Threaded or multi-processor job

This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.

These applications required shared memory and can only run on one node; as such it is important to remember the following:

  • You must set --nodes=1, and then set --cpus-per-task to the number of OpenMP threads you wish to use.
  • You must make the application aware of how many processors to use. How that is done depends on the application:
    • For some applications, set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task you set.
    • For some applications, use a command line option when calling that application.

Download the [{{#filelink: parallel_job.sh}} multi_processor_job.sh] script {{#fileanchor: parallel_job.sh}}

#!/bin/sh
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=ALL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>  # Where to send mail	
#SBATCH --nodes=1                    # Use one node
#SBATCH --ntasks=1                   # Run a single task	
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=600mb                  # Total memory limit
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.out     # Standard output and error log

pwd; hostname; date

echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"

module load gcc/5.2.0 

# Run your program with correct path and command line options
./YOURPROGRAM INPUT

date


Another example, setting OMP_NUM_THREADS:

Download the [{{#filelink: parallel_job2.sh}} multi_processor_job2.sh] script {{#fileanchor: parallel_job2.sh}}

#!/bin/sh
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=ALL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>  # Where to send mail	
#SBATCH --nodes=1                    # Use one node
#SBATCH --ntasks=1                   # Run a single task	
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=600mb                  # Total memory limit
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.out     # Standard output and error log

export OMP_NUM_THREADS=4

# Load required modules; for example, if your program was
# compiled with Intel compiler, use the following 
module load intel

# Run your program with correct path and command line options
./YOURPROGRAM INPUT

MPI job

This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple servers.

Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, but the following directives are the main directives to pay attention to:

  • -c, --cpus-per-task=<ncpus>
    • Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task.
  • -m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>
    • Specify alternate distribution methods for remote processes.
    • We recommend -m cyclic:cyclic, which tells SLURM to distribute tasks cyclically over nodes and sockets.
  • -N, --nodes=<minnodes[-maxnodes]>
    • Request that a minimum of minnodes nodes be allocated to this job.
  • -n, --ntasks=<number>
    • Number of tasks (MPI ranks)
  • --ntasks-per-node=<ntasks>
    • Request that ntasks be invoked on each node
  • --ntasks-per-socket=<ntasks>
    • Request the maximum ntasks be invoked on each socket

The following example requests 24 tasks, each with one core. It further specifies that these should be split evenly into 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.

SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best request for your specific use.

Download the [{{#filelink: mpi_job.sh}} mpi_job.sh] script {{#fileanchor: mpi_job.sh}}

#!/bin/sh
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --mail-type=ALL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>  # Where to send mail	
#SBATCH --ntasks=24                  # Number of MPI ranks
#SBATCH --cpus-per-task=1            # Number of cores per MPI rank 
#SBATCH --nodes=2                    # Number of nodes
#SBATCH --ntasks-per-node=12         # How many tasks on each node
#SBATCH --ntasks-per-socket=6        # How many tasks on each CPU or socket
#SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically on nodes and sockets
#SBATCH --mem-per-cpu=600mb          # Memory per processor
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=mpi_test_%j.out     # Standard output and error log
pwd; hostname; date

echo "Running prime number generator program on $SLURM_JOB_NUM_NODES nodes with $SLURM_NTASKS tasks, each with $SLURM_CPUS_PER_TASK cores."

module load intel/2016.0.109 openmpi/1.10.2

srun --mpi=pmi2 /ufrc/data/training/SLURM/prime/prime_mpi

date

Hybrid MPI/Threaded job

This script can serve as a template for hybrid MPI/Threaded applications. These are MPI applications where each MPI rank is threaded and can use multiple processors.

Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, as well as the information in the MPI example above.

The following example requests 8 tasks, each with 4 cores. It further specifies that these should be split evenly into 2 nodes, and within the nodes, the 4 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 2 tasks, each with 4 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.

Download the [{{#filelink: hybrid_pthreads_job.sh}} hybrid_pthreads_job.sh] script {{#fileanchor: hybrid_pthreads_job.sh}}

#!/bin/sh
#SBATCH --job-name=hybrid_job_test      # Job name
#SBATCH --mail-type=ALL                 # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address>     # Where to send mail	
#SBATCH --ntasks=8                      # Number of MPI ranks
#SBATCH --cpus-per-task=4               # Number of cores per MPI rank 
#SBATCH --nodes=2                       # Number of nodes
#SBATCH --ntasks-per-node=4             # How many tasks on each node
#SBATCH --ntasks-per-socket=2           # How many tasks on each CPU or socket
#SBATCH --mem-per-cpu=100mb             # Memory per core
#SBATCH --time=00:05:00                 # Time limit hrs:min:sec
#SBATCH --output=hybrid_test_%j.out     # Standard output and error log

pwd; hostname; date
 
module load intel/2016.0.109 openmpi/1.10.2 raxml/8.2.8
 
srun --mpi=pmi2 raxmlHPC-HYBRID-SSE3 -T $SLURM_CPUS_PER_TASK \
      -f a -m GTRGAMMA -s /ufrc/data/training/SLURM/dna.phy -p $RANDOM \
      -x $RANDOM -N 500 -n dna
 
date

The following example requests 8 tasks, each with 8 cores. It further specifies that these should be split evenly on 4 nodes, and within the nodes, the 2 tasks should be split, one on each of the two sockets. So each CPU on the two nodes will have 1 task, each with 8 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.

Also note setting OMP_NUM_THREADS so that OpenMP knows how many threads to use per task.

Download the [{{#filelink: hybrid_OpenMP_job.sh}} hybrid_OpenMP_job.sh] script {{#fileanchor: hybrid_OpenMP_job.sh}}

#!/bin/bash

#SBATCH --job-name=LAMMPS
#SBATCH --output=LAMMPS_%j.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=4              # Number of nodes
#SBATCH --ntasks=8             # Number of MPI ranks
#SBATCH --ntasks-per-node=2    # Number of MPI ranks per node
#SBATCH --ntasks-per-socket=1  # Number of tasks per processor socket on the node
#SBATCH --cpus-per-task=8      # Number of OpenMP threads for each MPI process/rank
#SBATCH --mem-per-cpu=2000mb   # Per processor memory request
#SBATCH --time=4-00:00:00      # Walltime in hh:mm:ss or d-hh:mm:ss

date
hostname

module load intel/2016.0.109 openmpi/1.10.2

export OMP_NUM_THREADS=8

srun --mpi=pmi2 /path/to/app/lmp_gator2 < in.Cu.v.24nm.eq_xrd
  • Note that MPI gets -np from SLURM automatically.
  • Note there are many directives available to control processor layout.
    • Some to pay particular attention to are:
      • --nodes if you care exactly how many nodes are used
      • --ntasks-per-node to limit number of tasks on a node
      • --distribution one of several directives (see also --contiguous, --cores-per-socket, --mem_bind, --ntasks-per-socket, --sockets-per-node) to control how tasks, cores and memory are distributed among nodes, sockets and cores. While SLURM will generally make appropriate decisions for setting up jobs, careful use of these directives can significantly enhance job performance and users are encouraged to profile application performance under different conditions.

Array job

Note that we use the simplest 'single-threaded' process example from above and extending it to an array of jobs. Modify the following script using the parallel, mpi, or hybrid job layout as needed.

Download the [{{#filelink: array_job.sh}} array_job.sh] script {{#fileanchor: array_job.sh}}

#!/bin/sh
#SBATCH --job-name=array_job_test   # Job name
#SBATCH --mail-type=ALL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email_address> # Where to send mail	
#SBATCH --nodes=1                   # Use one node
#SBATCH --ntasks=1                  # Run a single task
#SBATCH --mem-per-cpu=1gb           # Memory per processor
#SBATCH --time=00:05:00             # Time limit hrs:min:sec
#SBATCH --output=array_%A-$a.out    # Standard output and error log
#SBATCH --array=1-5                 # Array range
pwd; hostname; date

echo This is task $SLURM_ARRAY_TASK_ID

date

Note the use of %A for the master job ID of the array, and the %a for the task ID in the output filename.

GPU job


#!/bin/bash
#SBATCH --job-name=gpuMemTest
#SBATCH --output=gpuMemTest.out
#SBATCH --error=gpuMemTest.err
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2000
#SBATCH --mail-type=ALL
#SBATCH --mail-user=moulik@umd.edu
#SBATCH --account=ved-lab
#SBATCH --gres=gpu:12

module load cuda/8.0

cudaMemTest=/homs/moulik/CODES//cuda_memtest

cudaDevs=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')

for cudaDev in $cudaDevs
do
  echo cudaDev = $cudaDev
  $cudaMemTest --num_passes 1 --device $cudaDev > gpuMemTest.out.$cudaDev 2>&1 &
done
wait

Software

By default, very little software is automatically available, you have to specify the software you want by using a series of module commands before compiling the code and within you sbatch scripts. For a detailed explanation and a list of modules that *could* be available on Deepthought2, see the DIT Deepthought Software Guide. NOTE: This list is the list of modules available before Deepthought 2; in order to get a clean installation on the supercomputer and remove old unused code the IT team decided to not move modules over to Deepthought2 compute nodes until requested by users. Because of this, a module might load while using the login nodes and not on the compute nodes. The login nodes use a different filesystem and contain all of the previously available modules, whereas the compute nodes do not. Clicking on any the possible available modules on the DIT Deepthought Software Guide will tell you if it is available on Deepthought2. Any modules not available can be requested. If you have IT create new modules that may be of use to others in Geology, please update the following list here.

When specifying modules to load, they should always be specified in the following order

  1. intel (or nothing if using gfortran)
  2. openmpi
  3. netcdf and/or hdf4/5
  4. netcdf-fortran
  5. other stuff

Confirmed working software

The following list of software and versions is confirmed to be on the Deepthought 2 compute nodes:

  • hdf4/4.2.10
  • hdf5/1.8.13
  • intel/2013.1.039
  • netcdf/4.3.2
  • netcdf-fortran/4.4.1
  • nco/4.4.6
  • openmpi/gnu/1.6.5 and openmpi/intel/1.8.1
  • python/2.7.8

Other packages built for us can be seen Here.

Sample gfortran environment

The following is an example of a configuration known to be working with the gfortran compiler

#!/bin/bash
module load openmpi/gnu/1.6.5
module load netcdf/4.3.2
module load netcdf-fortran

export NETCDF=$NETCDF_FORTRAN_ROOT
export LD_LIBRARY_PATH=$NETCDF_LIBDIR:$NETCDF_FORTRAN_LIBDIR:$LD_LIBRARY_PATH
export FC=mpif90
export F77=mpif90
export LDFLAGS="$(nc-config --flibs --libs)"
export CPFLAGS="$(nc-config --fflags)"</nowiki>

Sample ifort environment

The following is an example of a configuration known to be working with the intel compiler (pay special attention to the ulimit command at the bottom, this is required to get any large program working with intel on these computers). For those using a csh shell instead, you'll have to use the command limit stacksize unlimited

#!/bin/bash
module load intel
module load openmpi/intel/1.8.1
module load netcdf/4.3.2
module load netcdf-fortran

export NETCDF=$NETCDF_FORTRAN_ROOT
export LD_LIBRARY_PATH=$NETCDF_LIBDIR:$NETCDF_FORTRAN_LIBDIR:$LD_LIBRARY_PATH
export FC=mpif90
export F77=mpif90
export LDFLAGS="$(nc-config --flibs --libs)"
export CPFLAGS="$(nc-config --fflags)"

ulimit -s unlimited