Difference between revisions of "Deepthought2"
(16 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
=Overview= | =Overview= | ||
− | The Deepthought2 supercomputer consists of 2 login nodes and >400 compute nodes. After logging into <code>login.deepthought2.umd.edu</code> you'll be assigned to either <code>login-1</code> or <code>login-2</code>. The login nodes are to be used to compile code, move data, launch jobs, etc, and they have all the same software (and more) than the compute nodes. '''''Please be sure''''' not to use these nodes to directly run a multi-core job | + | The Deepthought2 supercomputer consists of 2 login nodes and >400 compute nodes. After logging into <code>login.deepthought2.umd.edu</code> you'll be assigned to either <code>login-1</code> or <code>login-2</code>. The login nodes are to be used to compile code, move data, launch jobs, etc, and they have all the same software (and more) than the compute nodes. '''''Please be sure''''' not to use these nodes to directly run a multi-core job. Jobs run on 1 or more of the compute nodes, accessible only from the login nodes. These '''400''' compute nodes consist of: |
* 20 cores | * 20 cores | ||
* 128 GB of memory | * 128 GB of memory | ||
Line 19: | Line 19: | ||
* '''Ram Disk''' - Each compute node can use part of its 128GB of memory as a ram disk located under '''<code>/dev/shm/</code>''' This is temporary disk space similar to '''<code>/tmp</code>''', however it resides entirely in memory, and so will be extremely fast. | * '''Ram Disk''' - Each compute node can use part of its 128GB of memory as a ram disk located under '''<code>/dev/shm/</code>''' This is temporary disk space similar to '''<code>/tmp</code>''', however it resides entirely in memory, and so will be extremely fast. | ||
− | = | + | =Allocations= |
− | Every job submitted to the cluster is billed to an account. You can find out the accounts you have access to by using | + | Every job submitted to the cluster is billed to an account. You can find out the accounts you have access to by using <code>sbalance</code> on the command line. |
− | == | + | ==Accounts== |
− | When submitting a job using | + | When submitting a job using <code>sbatch -A <account> <script></code> or using <code>#SBATCH -A <account></code> within a script, there are two main categories of pools to use: |
− | * ''' | + | * '''High Priority''' - These are our main pool of hours and are replenished at the beginning of each month with 610,000 CPU hours. If there are other jobs waiting to run on Deepthought, your job will be placed in the queue with a high priority. '''''These queues must absolutely be used first, as they disappear at the end of the month. Moreover, usage of standard queues below means you are eating into next month’s High Priority allocations.''' Select between '''ved-prj-hi''', '''ved-lab-hi''', '''schmerr-lab-hi''' or '''schmerr-prj-hi'''. |
− | + | * '''Standard''' - These accounts queue jobs at a lower priority. They are the hours left over from the previous month, plus hours from the next month that you can borrow from. '''So, try to limit usage once this gets below 610kSU.''' Select between '''ved-prj''', '''ved-lab''', '''schmerr-lab''' or '''schmerr-prj''' | |
− | In addition to specifying an account to charge when running a job, the '''partition''' used can be specified with | + | ==Partitions== |
+ | In addition to specifying an account to charge when running a job, the '''partition''' used can be specified with <code>sbatch -A <account> -p <partition> <script></code>. Normally you should only do this in 2 circumstances: | ||
* '''scavenger''' - by running with <code>sbatch -A <account> -p scavenger </code> your job will run only if there are no other normal jobs in the queue. Although no hours will be charged to the account, your job may be interrupted if a normal job enters the queue and there are no other nodes available. Your job will be put back in the queue and will wait to run again. Your job script must therefore be able to be easily stopped and restarted. The benefit of scavenger is of course that no hours will be charged to the department. This might be quite useful for the data assimilatoin people doing their usual DA cycles that can easily be restated. | * '''scavenger''' - by running with <code>sbatch -A <account> -p scavenger </code> your job will run only if there are no other normal jobs in the queue. Although no hours will be charged to the account, your job may be interrupted if a normal job enters the queue and there are no other nodes available. Your job will be put back in the queue and will wait to run again. Your job script must therefore be able to be easily stopped and restarted. The benefit of scavenger is of course that no hours will be charged to the department. This might be quite useful for the data assimilatoin people doing their usual DA cycles that can easily be restated. | ||
* '''debug''' - by running with <code>sbatch -A <account> -p debug </code> your job will be placed in the queue with a high priority (regardless of the account specified) though will only run for a maximum of 15 minutes. | * '''debug''' - by running with <code>sbatch -A <account> -p debug </code> your job will be placed in the queue with a high priority (regardless of the account specified) though will only run for a maximum of 15 minutes. | ||
Line 33: | Line 34: | ||
== Viewing Remaining Hours == | == Viewing Remaining Hours == | ||
Running <code>sbalance --all</code> will show how many hours are remaining for our department in the given month (e.g. '''ved-lab-hi''') as well as the hours leftover from the previous month plus hours that can stolen from the next month (e.g. '''ved-lab'''). Additionally, you can see the individual usage for the members in our department. | Running <code>sbalance --all</code> will show how many hours are remaining for our department in the given month (e.g. '''ved-lab-hi''') as well as the hours leftover from the previous month plus hours that can stolen from the next month (e.g. '''ved-lab'''). Additionally, you can see the individual usage for the members in our department. | ||
+ | |||
+ | =Interactive Debugging= | ||
+ | Interactive sessions allow you to connect to a compute node and work on that node directly. This allows you to develop how your jobs might run (i.e. test that commands run as expected before putting them in a script) and do heavy development tasks that cannot be done on the login nodes (i.e. use many cores). Interactive sessions can be started with either <code>sinteractive</code> or <code>salloc</code> commands on Deepthought2. | ||
+ | |||
+ | ==Using sinteractive (preferred)== | ||
+ | For debugging purposes, instead of running directly on login mode, it is recommended to request a node first with <code>sinteractive</code> command. | ||
+ | <source lang=bash> | ||
+ | [moulik@login-1 ~/Slurm]> sinteractive -h | ||
+ | Usage: sinteractive [-c NUMCPUS] [-J JOBNAME] [-a ACCOUNT] [ -t TIME ] \ | ||
+ | [ -d | -S ] [ -s SHELL ] [ -x ] \ | ||
+ | [ -f FEATURE_LIST ] [ -g GRES_LIST] | ||
+ | |||
+ | Optional arguments: | ||
+ | -a: Account to charge. Defaults to your default account (ved-lab-hi) | ||
+ | -c: number of CPU cores to request (default: 1) | ||
+ | -d: use the debug partition. -t is ignored and Wall time is set to 15 minutes | ||
+ | -J: job name (default: interactive) | ||
+ | -s: shell to use. Defaults to your default shell (/bin/tcsh) | ||
+ | -S: use the scavenger partition. Not advised. | ||
+ | -t: Wall time limit in minutes (default: 60 minutes). | ||
+ | -x: Reserve the nodes in exclusive mode, Exclusive mode means no other | ||
+ | jobs are allowed on the node you reserve, which means it might take | ||
+ | longer to allocate and your account will be charged more. Default is | ||
+ | shared mode. | ||
+ | -f: Only reserve nodes matching FEATURE_LIST constraints. See salloc man page | ||
+ | for full description. | ||
+ | -g: Reserve the generic consumable resources specified by GRES_LIST. | ||
+ | |||
+ | Maximum allowed walltime for sinteractive is 480 minutes. | ||
+ | Maximum number of CPUs for sinteractive is 20 cpus. | ||
+ | </source> | ||
+ | |||
+ | To request 1 core for 30 min: | ||
+ | <code>sinteractive -c 1 -t 30</code> | ||
+ | and to request access to a GPU enabled node: | ||
+ | <code>sinteractive -c 1 -t 30 -g gpu:1</code> | ||
+ | |||
+ | To make sure you have accessed a node with NVIDIA GPUs, you can type to following command to list the GPU configuration: | ||
+ | <source lang=bash> | ||
+ | [moulik@compute-b17-4 ~/Slurm]> lspci | grep -i nvidia | ||
+ | 03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) | ||
+ | 83:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) | ||
+ | </source> | ||
+ | |||
+ | ==Using salloc== | ||
+ | Another way to request a node is with the <code>salloc</code> command. Please note <code>salloc</code> will launch the job with the following defaults: you default project account; ntask=1 (1 cpu core); memory=1G; time-24hours and no other resources such as GPU card. To specify resources: | ||
+ | <source lang=bash> | ||
+ | login-1:~ salloc --account=ved-lab-hi --partition=debug --time=15 | ||
+ | salloc: Granted job allocation 5227967 | ||
+ | salloc: Waiting for resource configuration | ||
+ | salloc: Nodes compute-b28-49 are ready for job | ||
+ | </source> | ||
+ | The above command successfully requested a computing node for 15 minutes (-p debug gives a higher priority in the queue, but limit time to 15 minutes; drop this option if a longer time is needed). To login to the node: | ||
+ | <code>ssh -Y compute-b28-47.deepthought2.umd.edu</code> | ||
+ | |||
+ | ==Closing sessions== | ||
+ | '''Interactive Jobs will remain active until exit or the job is canceled.''' It is your responsibility to cancel any interactive session that is not being used. After you are done with debugging and enter <code>exit</code> to close your interactive session, make sure you are no longer using any resources by first checking if any interactive job is still running: | ||
+ | <source lang=bash> | ||
+ | [moulik@login-1 ~/Slurm]> squeue -u moulik | ||
+ | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
+ | 10347469 high-prio interact moulik PD 0:00 1 (Priority) | ||
+ | 10347471 high-prio test_spa moulik PD 0:00 1 (Priority) | ||
+ | 10347475 debug tcsh moulik R 5:32 1 compute-b28-49 | ||
+ | </source> | ||
+ | To kill the interactive session, note the JOBID from the output above and cancel it: | ||
+ | <code>scancel 10347475</code> | ||
=Running Jobs= | =Running Jobs= | ||
Line 43: | Line 110: | ||
indicating the number of nodes you want, and the maximum running time. The script is then placed in the queue with <code>sbatch -A <account> script_name</code> | indicating the number of nodes you want, and the maximum running time. The script is then placed in the queue with <code>sbatch -A <account> script_name</code> | ||
− | Keep in mind that each of the nodes has 20 cores, and using '''any''' core on a node will result in being charged for usage of the entire node, so optimize your configuration accordingly (i.e. it would be a waste to request 22 cores since you would be charged for 40 cores) So, for example, using 10 nodes for a whole day would charge 4,800 hours to the department's account. | + | Keep in mind that each of the nodes has 20 cores, and using '''any''' core on a node will result in being charged for usage of the entire node, so optimize your configuration accordingly (i.e. it would be a waste to request 22 cores since you would be charged for 40 cores) So, for example, using 10 nodes for a whole day would charge 4,800 hours to the department's account. If you are requesting more than one core but less than the all the cores on the node on the Deepthought clusters, you should consider using the <code>--share</code> flag. The default <code>--exclusive</code> flag will result in your account being charged for all cores on the node whether you use them or not. Not sharing also lowers our [https://slurm.schedmd.com/priority_multifactor.html#fairshare Fairshare score], leading to delay in scheduling jobs. |
+ | <source lang=bash> | ||
+ | #SBATCH --share | ||
+ | </source> | ||
− | + | Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as <code>/tmp</code>. Scratch space will be cleared once your job completes. The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs. Note that the disk space size must be given in MB. | |
− | |||
<source lang=bash> | <source lang=bash> | ||
− | + | #SBATCH --tmp=5120 | |
− | |||
− | |||
− | |||
</source> | </source> | ||
− | |||
− | |||
− | |||
− | ==Checking | + | ==Checking Jobs== |
To view a list of all jobs you have running, you can use the <code>squeue</code> command, for example: | To view a list of all jobs you have running, you can use the <code>squeue</code> command, for example: | ||
− | < | + | <source lang="bash"> |
login-1:~ squeue -u moulik | login-1:~ squeue -u moulik | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
2736589 ved-lab-hi test1 moulik R 20:14:29 1 compute-b20-4 | 2736589 ved-lab-hi test1 moulik R 20:14:29 1 compute-b20-4 | ||
2736588 ved-lab-hi test2 moulik R 20:15:45 1 compute-b20-2 | 2736588 ved-lab-hi test2 moulik R 20:15:45 1 compute-b20-2 | ||
− | </ | + | </source> |
+ | |||
+ | If the jobs are taking a while to get scheduled, the Reason column in the <code>squeue</code> output can give you a clue: | ||
+ | |||
+ | * If there is no reason, the scheduler hasn't attended to your submission yet. | ||
+ | * '''Resources''' means your job is waiting for an appropriate compute node to open. | ||
+ | * '''Priority''' indicates your priority is lower relative to others being scheduled. | ||
+ | * There are other Reason codes; see the [https://slurm.schedmd.com/squeue.html SLURM squeue documentation] for full details. | ||
+ | |||
+ | Your priority is partially based on your [https://slurm.schedmd.com/priority_multifactor.html#fairshare FairShare score] and determines how quickly your job is scheduled relative to others on the cluster. To see your FairShare score, enter the command <code>sshare -u <username></code>. Your effective score is the value in the last column, and, as a rule of thumb, can be assessed as lower priority ≤ 0.5 ≤ higher priority. | ||
+ | |||
+ | ==Deleting Jobs== | ||
+ | The <code>scancel</code> command is used to delete jobs. Examples: | ||
+ | <source lang="bash"> | ||
+ | scancel 232323 (delete job 232323) | ||
+ | scancel -u username (delete all jobs belonging to user) | ||
+ | scancel --name=JobName (delete job with the name JobName) | ||
+ | scancel --state=PENDING (delete all PENDING jobs) | ||
+ | scancel --state=RUNNING (delete all RUNNING jobs) | ||
+ | scancel --nodelist=cn0005 (delete any jobs running on node cn0005) | ||
+ | </source> | ||
==Email Notification== | ==Email Notification== | ||
Line 80: | Line 163: | ||
=Sample SLURM Scripts= | =Sample SLURM Scripts= | ||
− | Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on Deepthought2. If you choose to copy one of these sample scripts, please make sure you understand what each line of the sbatch directives before using it to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources. | + | The Deepthought HPC clusters use a batch scheduling system called [https://slurm.schedmd.com Slurm] to handle the queuing, scheduling, and execution of jobs. This scheduler is used in many recent HPC clusters throughout the world. Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on Deepthought2. If you choose to copy one of these sample scripts, please make sure you understand what each line of the sbatch directives before using it to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources. |
==Basic, single-processor job== | ==Basic, single-processor job== | ||
Line 391: | Line 474: | ||
==Sample ifort environment== | ==Sample ifort environment== | ||
− | The following is an example of a configuration known to be working with the intel compiler (pay special attention to the | + | The following is an example of a configuration known to be working with the intel compiler (pay special attention to the <code>ulimit</code> command at the bottom, this is required to get any large program working with intel on these computers). For those using a csh shell instead, you'll have to use the command <code>limit stacksize unlimited</code> |
<source lang=bash> | <source lang=bash> | ||
#!/bin/bash | #!/bin/bash | ||
Line 407: | Line 490: | ||
ulimit -s unlimited | ulimit -s unlimited | ||
+ | </source> | ||
+ | |||
+ | ==Sample Python environment== | ||
+ | Most modules were built using python/2.7.8 compiled with gcc/4.6.1 compiler. The GNU compiler (gcc/4.6.1) is what one gets by default if no other compiler (gcc, intel, pgi, sunsuite) is loaded. | ||
+ | <source lang=tcsh> | ||
+ | #!/bin/tcsh | ||
+ | module load openmpi/1.6.5 | ||
+ | module load python/3.5.1 | ||
+ | module load cuda/7.5.18 | ||
+ | |||
+ | python ./YOURCODE.py | ||
</source> | </source> |
Latest revision as of 20:35, 1 July 2018
This is an overview of key areas to get you started with things common to those in our department. For a more general discussion, please check out the DIT Deepthought Usage .
Contents
Overview
The Deepthought2 supercomputer consists of 2 login nodes and >400 compute nodes. After logging into login.deepthought2.umd.edu
you'll be assigned to either login-1
or login-2
. The login nodes are to be used to compile code, move data, launch jobs, etc, and they have all the same software (and more) than the compute nodes. Please be sure not to use these nodes to directly run a multi-core job. Jobs run on 1 or more of the compute nodes, accessible only from the login nodes. These 400 compute nodes consist of:
- 20 cores
- 128 GB of memory
- 750 GB temporary hard drive space
- FDR Infiniband network connection between nodes
There are additional nodes that have increased memory (1TB) or GPUs if needed.
Disk Space
There are several areas of disk space available on Deepthought for the users:
- Home Directory - Everyone is given 10GB of space in their home directory. Access to the home directory is slow from the compute nodes, so try not to store data/programs needed while running in here. Also, you'll get warnings if launching an MPI job from here.
- Lustre - 1 Petabyte of storage is available under
/lustre/<username>
. Although there is currently no user quotas for this file system, there will inevitably be one imposed eventually. Code to be run should be stored here, though this system is not backed up. Lustre can slow down considerably depending on how people are using DT2 at that time, so to make your experiments as fast as possible it is best to make sure the bulk of the data you need is copied locally to the node's /tmp folder at run time. - Scratch Disk - Each compute node has 750GB of space available under
/tmp/
. Faster than the Lustre directory, though not able to be shared between nodes. This drive is erased after a job's completion. - Ram Disk - Each compute node can use part of its 128GB of memory as a ram disk located under
/dev/shm/
This is temporary disk space similar to/tmp
, however it resides entirely in memory, and so will be extremely fast.
Allocations
Every job submitted to the cluster is billed to an account. You can find out the accounts you have access to by using sbalance
on the command line.
Accounts
When submitting a job using sbatch -A <account> <script>
or using #SBATCH -A <account>
within a script, there are two main categories of pools to use:
- High Priority - These are our main pool of hours and are replenished at the beginning of each month with 610,000 CPU hours. If there are other jobs waiting to run on Deepthought, your job will be placed in the queue with a high priority. These queues must absolutely be used first, as they disappear at the end of the month. Moreover, usage of standard queues below means you are eating into next month’s High Priority allocations. Select between ved-prj-hi, ved-lab-hi, schmerr-lab-hi or schmerr-prj-hi.
- Standard - These accounts queue jobs at a lower priority. They are the hours left over from the previous month, plus hours from the next month that you can borrow from. So, try to limit usage once this gets below 610kSU. Select between ved-prj, ved-lab, schmerr-lab or schmerr-prj
Partitions
In addition to specifying an account to charge when running a job, the partition used can be specified with sbatch -A <account> -p <partition> <script>
. Normally you should only do this in 2 circumstances:
- scavenger - by running with
sbatch -A <account> -p scavenger
your job will run only if there are no other normal jobs in the queue. Although no hours will be charged to the account, your job may be interrupted if a normal job enters the queue and there are no other nodes available. Your job will be put back in the queue and will wait to run again. Your job script must therefore be able to be easily stopped and restarted. The benefit of scavenger is of course that no hours will be charged to the department. This might be quite useful for the data assimilatoin people doing their usual DA cycles that can easily be restated. - debug - by running with
sbatch -A <account> -p debug
your job will be placed in the queue with a high priority (regardless of the account specified) though will only run for a maximum of 15 minutes.
Viewing Remaining Hours
Running sbalance --all
will show how many hours are remaining for our department in the given month (e.g. ved-lab-hi) as well as the hours leftover from the previous month plus hours that can stolen from the next month (e.g. ved-lab). Additionally, you can see the individual usage for the members in our department.
Interactive Debugging
Interactive sessions allow you to connect to a compute node and work on that node directly. This allows you to develop how your jobs might run (i.e. test that commands run as expected before putting them in a script) and do heavy development tasks that cannot be done on the login nodes (i.e. use many cores). Interactive sessions can be started with either sinteractive
or salloc
commands on Deepthought2.
Using sinteractive (preferred)
For debugging purposes, instead of running directly on login mode, it is recommended to request a node first with sinteractive
command.
[moulik@login-1 ~/Slurm]> sinteractive -h Usage: sinteractive [-c NUMCPUS] [-J JOBNAME] [-a ACCOUNT] [ -t TIME ] \ [ -d | -S ] [ -s SHELL ] [ -x ] \ [ -f FEATURE_LIST ] [ -g GRES_LIST] Optional arguments: -a: Account to charge. Defaults to your default account (ved-lab-hi) -c: number of CPU cores to request (default: 1) -d: use the debug partition. -t is ignored and Wall time is set to 15 minutes -J: job name (default: interactive) -s: shell to use. Defaults to your default shell (/bin/tcsh) -S: use the scavenger partition. Not advised. -t: Wall time limit in minutes (default: 60 minutes). -x: Reserve the nodes in exclusive mode, Exclusive mode means no other jobs are allowed on the node you reserve, which means it might take longer to allocate and your account will be charged more. Default is shared mode. -f: Only reserve nodes matching FEATURE_LIST constraints. See salloc man page for full description. -g: Reserve the generic consumable resources specified by GRES_LIST. Maximum allowed walltime for sinteractive is 480 minutes. Maximum number of CPUs for sinteractive is 20 cpus.
To request 1 core for 30 min:
sinteractive -c 1 -t 30
and to request access to a GPU enabled node:
sinteractive -c 1 -t 30 -g gpu:1
To make sure you have accessed a node with NVIDIA GPUs, you can type to following command to list the GPU configuration:
[moulik@compute-b17-4 ~/Slurm]> lspci | grep -i nvidia 03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) 83:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1)
Using salloc
Another way to request a node is with the salloc
command. Please note salloc
will launch the job with the following defaults: you default project account; ntask=1 (1 cpu core); memory=1G; time-24hours and no other resources such as GPU card. To specify resources:
login-1:~ salloc --account=ved-lab-hi --partition=debug --time=15 salloc: Granted job allocation 5227967 salloc: Waiting for resource configuration salloc: Nodes compute-b28-49 are ready for job
The above command successfully requested a computing node for 15 minutes (-p debug gives a higher priority in the queue, but limit time to 15 minutes; drop this option if a longer time is needed). To login to the node:
ssh -Y compute-b28-47.deepthought2.umd.edu
Closing sessions
Interactive Jobs will remain active until exit or the job is canceled. It is your responsibility to cancel any interactive session that is not being used. After you are done with debugging and enter exit
to close your interactive session, make sure you are no longer using any resources by first checking if any interactive job is still running:
[moulik@login-1 ~/Slurm]> squeue -u moulik JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10347469 high-prio interact moulik PD 0:00 1 (Priority) 10347471 high-prio test_spa moulik PD 0:00 1 (Priority) 10347475 debug tcsh moulik R 5:32 1 compute-b28-49
To kill the interactive session, note the JOBID from the output above and cancel it:
scancel 10347475
Running Jobs
The usual way of running a job is to create a script file that is submitted to the scheduling system with the sbatch
command. Extensive details on this can be found at DIT's info on running jobs. In summary, your script will consist of at least the following lines at the top:
#!/bin/bash #SBATCH -N 2 #SBATCH -t 2:00:00
indicating the number of nodes you want, and the maximum running time. The script is then placed in the queue with sbatch -A <account> script_name
Keep in mind that each of the nodes has 20 cores, and using any core on a node will result in being charged for usage of the entire node, so optimize your configuration accordingly (i.e. it would be a waste to request 22 cores since you would be charged for 40 cores) So, for example, using 10 nodes for a whole day would charge 4,800 hours to the department's account. If you are requesting more than one core but less than the all the cores on the node on the Deepthought clusters, you should consider using the --share
flag. The default --exclusive
flag will result in your account being charged for all cores on the node whether you use them or not. Not sharing also lowers our Fairshare score, leading to delay in scheduling jobs.
#SBATCH --share
Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as /tmp
. Scratch space will be cleared once your job completes. The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs. Note that the disk space size must be given in MB.
#SBATCH --tmp=5120
Checking Jobs
To view a list of all jobs you have running, you can use the squeue
command, for example:
login-1:~ squeue -u moulik JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2736589 ved-lab-hi test1 moulik R 20:14:29 1 compute-b20-4 2736588 ved-lab-hi test2 moulik R 20:15:45 1 compute-b20-2
If the jobs are taking a while to get scheduled, the Reason column in the squeue
output can give you a clue:
- If there is no reason, the scheduler hasn't attended to your submission yet.
- Resources means your job is waiting for an appropriate compute node to open.
- Priority indicates your priority is lower relative to others being scheduled.
- There are other Reason codes; see the SLURM squeue documentation for full details.
Your priority is partially based on your FairShare score and determines how quickly your job is scheduled relative to others on the cluster. To see your FairShare score, enter the command sshare -u <username>
. Your effective score is the value in the last column, and, as a rule of thumb, can be assessed as lower priority ≤ 0.5 ≤ higher priority.
Deleting Jobs
The scancel
command is used to delete jobs. Examples:
scancel 232323 (delete job 232323) scancel -u username (delete all jobs belonging to user) scancel --name=JobName (delete job with the name JobName) scancel --state=PENDING (delete all PENDING jobs) scancel --state=RUNNING (delete all RUNNING jobs) scancel --nodelist=cn0005 (delete any jobs running on node cn0005)
Email Notification
To receive email notification of your job finishing (or crashing) you can set the --mail-type=
and --mail-user=
parameters at the top of your job's batch script, for example:
#!/bin/bash #SBATCH -n 20 #SBATCH -t 1:00:00 #SBATCH --mail-type=END #SBATCH --mail-type=FAIL #SBATCH --mail-user=moulik@umd.edu
Sample SLURM Scripts
The Deepthought HPC clusters use a batch scheduling system called Slurm to handle the queuing, scheduling, and execution of jobs. This scheduler is used in many recent HPC clusters throughout the world. Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on Deepthought2. If you choose to copy one of these sample scripts, please make sure you understand what each line of the sbatch directives before using it to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources.
Basic, single-processor job
This script can serve as the template for many single-processor applications. The mem-per-cpu flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the -o (can also use --output) line tells SLURM to substitute the job ID in the name of the output file. You can also add a -e or --error with an error file name to separate output and error logs.
Download the [{{#filelink: single_job.sh}} single_processor_job.sh] script {{#fileanchor: single_job.sh}}
#!/bin/sh #SBATCH --job-name=serial_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --ntasks=1 # Run on a single CPU #SBATCH --mem=600mb # Memory limit #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=serial_test_%j.out # Standard output and error log pwd; hostname; date module load python echo "Running plot script on a single CPU core" # Run your program with correct path and command line options ./YOURPROGRAM INPUT #python /homes/moulik/plot_template.py date
Threaded or multi-processor job
This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.
These applications required shared memory and can only run on one node; as such it is important to remember the following:
- You must set
--nodes=1
, and then set--cpus-per-task
to the number of OpenMP threads you wish to use. - You must make the application aware of how many processors to use. How that is done depends on the application:
- For some applications, set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task you set.
- For some applications, use a command line option when calling that application.
Download the [{{#filelink: parallel_job.sh}} multi_processor_job.sh] script {{#fileanchor: parallel_job.sh}}
#!/bin/sh #SBATCH --job-name=parallel_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --nodes=1 # Use one node #SBATCH --ntasks=1 # Run a single task #SBATCH --cpus-per-task=4 # Number of CPU cores per task #SBATCH --mem=600mb # Total memory limit #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=parallel_%j.out # Standard output and error log pwd; hostname; date echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores" module load gcc/5.2.0 # Run your program with correct path and command line options ./YOURPROGRAM INPUT date
Another example, setting OMP_NUM_THREADS:
Download the [{{#filelink: parallel_job2.sh}} multi_processor_job2.sh] script {{#fileanchor: parallel_job2.sh}}
#!/bin/sh #SBATCH --job-name=parallel_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --nodes=1 # Use one node #SBATCH --ntasks=1 # Run a single task #SBATCH --cpus-per-task=4 # Number of CPU cores per task #SBATCH --mem=600mb # Total memory limit #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=parallel_%j.out # Standard output and error log export OMP_NUM_THREADS=4 # Load required modules; for example, if your program was # compiled with Intel compiler, use the following module load intel # Run your program with correct path and command line options ./YOURPROGRAM INPUT
MPI job
This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple servers.
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, but the following directives are the main directives to pay attention to:
-c, --cpus-per-task=<ncpus>
- Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task.
-m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>
- Specify alternate distribution methods for remote processes.
- We recommend
-m cyclic:cyclic
, which tells SLURM to distribute tasks cyclically over nodes and sockets.
-N, --nodes=<minnodes[-maxnodes]>
- Request that a minimum of minnodes nodes be allocated to this job.
-n, --ntasks=<number>
- Number of tasks (MPI ranks)
--ntasks-per-node=<ntasks>
- Request that ntasks be invoked on each node
--ntasks-per-socket=<ntasks>
- Request the maximum ntasks be invoked on each socket
The following example requests 24 tasks, each with one core. It further specifies that these should be split evenly into 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best request for your specific use.
Download the [{{#filelink: mpi_job.sh}} mpi_job.sh] script {{#fileanchor: mpi_job.sh}}
#!/bin/sh #SBATCH --job-name=mpi_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --ntasks=24 # Number of MPI ranks #SBATCH --cpus-per-task=1 # Number of cores per MPI rank #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=12 # How many tasks on each node #SBATCH --ntasks-per-socket=6 # How many tasks on each CPU or socket #SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically on nodes and sockets #SBATCH --mem-per-cpu=600mb # Memory per processor #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=mpi_test_%j.out # Standard output and error log pwd; hostname; date echo "Running prime number generator program on $SLURM_JOB_NUM_NODES nodes with $SLURM_NTASKS tasks, each with $SLURM_CPUS_PER_TASK cores." module load intel/2016.0.109 openmpi/1.10.2 srun --mpi=pmi2 /ufrc/data/training/SLURM/prime/prime_mpi date
Hybrid MPI/Threaded job
This script can serve as a template for hybrid MPI/Threaded applications. These are MPI applications where each MPI rank is threaded and can use multiple processors.
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, as well as the information in the MPI example above.
The following example requests 8 tasks, each with 4 cores. It further specifies that these should be split evenly into 2 nodes, and within the nodes, the 4 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 2 tasks, each with 4 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
Download the [{{#filelink: hybrid_pthreads_job.sh}} hybrid_pthreads_job.sh] script {{#fileanchor: hybrid_pthreads_job.sh}}
#!/bin/sh #SBATCH --job-name=hybrid_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --ntasks=8 # Number of MPI ranks #SBATCH --cpus-per-task=4 # Number of cores per MPI rank #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=4 # How many tasks on each node #SBATCH --ntasks-per-socket=2 # How many tasks on each CPU or socket #SBATCH --mem-per-cpu=100mb # Memory per core #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=hybrid_test_%j.out # Standard output and error log pwd; hostname; date module load intel/2016.0.109 openmpi/1.10.2 raxml/8.2.8 srun --mpi=pmi2 raxmlHPC-HYBRID-SSE3 -T $SLURM_CPUS_PER_TASK \ -f a -m GTRGAMMA -s /ufrc/data/training/SLURM/dna.phy -p $RANDOM \ -x $RANDOM -N 500 -n dna date
The following example requests 8 tasks, each with 8 cores. It further specifies that these should be split evenly on 4 nodes, and within the nodes, the 2 tasks should be split, one on each of the two sockets. So each CPU on the two nodes will have 1 task, each with 8 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
Also note setting OMP_NUM_THREADS so that OpenMP knows how many threads to use per task.
Download the [{{#filelink: hybrid_OpenMP_job.sh}} hybrid_OpenMP_job.sh] script {{#fileanchor: hybrid_OpenMP_job.sh}}
#!/bin/bash #SBATCH --job-name=LAMMPS #SBATCH --output=LAMMPS_%j.out #SBATCH --mail-type=ALL #SBATCH --mail-user=<email_address> #SBATCH --nodes=4 # Number of nodes #SBATCH --ntasks=8 # Number of MPI ranks #SBATCH --ntasks-per-node=2 # Number of MPI ranks per node #SBATCH --ntasks-per-socket=1 # Number of tasks per processor socket on the node #SBATCH --cpus-per-task=8 # Number of OpenMP threads for each MPI process/rank #SBATCH --mem-per-cpu=2000mb # Per processor memory request #SBATCH --time=4-00:00:00 # Walltime in hh:mm:ss or d-hh:mm:ss date hostname module load intel/2016.0.109 openmpi/1.10.2 export OMP_NUM_THREADS=8 srun --mpi=pmi2 /path/to/app/lmp_gator2 < in.Cu.v.24nm.eq_xrd
- Note that MPI gets -np from SLURM automatically.
- Note there are many directives available to control processor layout.
- Some to pay particular attention to are:
- --nodes if you care exactly how many nodes are used
- --ntasks-per-node to limit number of tasks on a node
- --distribution one of several directives (see also --contiguous, --cores-per-socket, --mem_bind, --ntasks-per-socket, --sockets-per-node) to control how tasks, cores and memory are distributed among nodes, sockets and cores. While SLURM will generally make appropriate decisions for setting up jobs, careful use of these directives can significantly enhance job performance and users are encouraged to profile application performance under different conditions.
- Some to pay particular attention to are:
Array job
Note that we use the simplest 'single-threaded' process example from above and extending it to an array of jobs. Modify the following script using the parallel, mpi, or hybrid job layout as needed.
Download the [{{#filelink: array_job.sh}} array_job.sh] script {{#fileanchor: array_job.sh}}
#!/bin/sh #SBATCH --job-name=array_job_test # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=<email_address> # Where to send mail #SBATCH --nodes=1 # Use one node #SBATCH --ntasks=1 # Run a single task #SBATCH --mem-per-cpu=1gb # Memory per processor #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=array_%A-$a.out # Standard output and error log #SBATCH --array=1-5 # Array range pwd; hostname; date echo This is task $SLURM_ARRAY_TASK_ID date
Note the use of %A for the master job ID of the array, and the %a for the task ID in the output filename.
GPU job
#!/bin/bash #SBATCH --job-name=gpuMemTest #SBATCH --output=gpuMemTest.out #SBATCH --error=gpuMemTest.err #SBATCH --ntasks=2 #SBATCH --cpus-per-task=1 #SBATCH --time=12:00:00 #SBATCH --mem-per-cpu=2000 #SBATCH --mail-type=ALL #SBATCH --mail-user=moulik@umd.edu #SBATCH --account=ved-lab #SBATCH --gres=gpu:12 module load cuda/8.0 cudaMemTest=/homs/moulik/CODES//cuda_memtest cudaDevs=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g') for cudaDev in $cudaDevs do echo cudaDev = $cudaDev $cudaMemTest --num_passes 1 --device $cudaDev > gpuMemTest.out.$cudaDev 2>&1 & done wait
Software
By default, very little software is automatically available, you have to specify the software you want by using a series of module
commands before compiling the code and within you sbatch scripts. For a detailed explanation and a list of modules that *could* be available on Deepthought2, see the DIT Deepthought Software Guide. NOTE: This list is the list of modules available before Deepthought 2; in order to get a clean installation on the supercomputer and remove old unused code the IT team decided to not move modules over to Deepthought2 compute nodes until requested by users. Because of this, a module might load while using the login nodes and not on the compute nodes. The login nodes use a different filesystem and contain all of the previously available modules, whereas the compute nodes do not. Clicking on any the possible available modules on the DIT Deepthought Software Guide will tell you if it is available on Deepthought2. Any modules not available can be requested. If you have IT create new modules that may be of use to others in Geology, please update the following list here.
When specifying modules to load, they should always be specified in the following order
- intel (or nothing if using gfortran)
- openmpi
- netcdf and/or hdf4/5
- netcdf-fortran
- other stuff
Confirmed working software
The following list of software and versions is confirmed to be on the Deepthought 2 compute nodes:
- hdf4/4.2.10
- hdf5/1.8.13
- intel/2013.1.039
- netcdf/4.3.2
- netcdf-fortran/4.4.1
- nco/4.4.6
- openmpi/gnu/1.6.5 and openmpi/intel/1.8.1
- python/2.7.8
Other packages built for us can be seen Here.
Sample gfortran environment
The following is an example of a configuration known to be working with the gfortran compiler
#!/bin/bash module load openmpi/gnu/1.6.5 module load netcdf/4.3.2 module load netcdf-fortran export NETCDF=$NETCDF_FORTRAN_ROOT export LD_LIBRARY_PATH=$NETCDF_LIBDIR:$NETCDF_FORTRAN_LIBDIR:$LD_LIBRARY_PATH export FC=mpif90 export F77=mpif90 export LDFLAGS="$(nc-config --flibs --libs)" export CPFLAGS="$(nc-config --fflags)"</nowiki>
Sample ifort environment
The following is an example of a configuration known to be working with the intel compiler (pay special attention to the ulimit
command at the bottom, this is required to get any large program working with intel on these computers). For those using a csh shell instead, you'll have to use the command limit stacksize unlimited
#!/bin/bash module load intel module load openmpi/intel/1.8.1 module load netcdf/4.3.2 module load netcdf-fortran export NETCDF=$NETCDF_FORTRAN_ROOT export LD_LIBRARY_PATH=$NETCDF_LIBDIR:$NETCDF_FORTRAN_LIBDIR:$LD_LIBRARY_PATH export FC=mpif90 export F77=mpif90 export LDFLAGS="$(nc-config --flibs --libs)" export CPFLAGS="$(nc-config --fflags)" ulimit -s unlimited
Sample Python environment
Most modules were built using python/2.7.8 compiled with gcc/4.6.1 compiler. The GNU compiler (gcc/4.6.1) is what one gets by default if no other compiler (gcc, intel, pgi, sunsuite) is loaded.
#!/bin/tcsh module load openmpi/1.6.5 module load python/3.5.1 module load cuda/7.5.18 python ./YOURCODE.py