Difference between revisions of "Slurm Quick Start"
Line 68: | Line 68: | ||
[[File:Slrm.png|left|frameless|447x447px]] |
[[File:Slrm.png|left|frameless|447x447px]] |
||
[[File:Slurm output.png|left|frameless]] |
[[File:Slurm output.png|left|frameless]] |
||
− | <br> |
||
<br> |
<br> |
||
<br> |
<br> |
||
<br> |
<br> |
||
− | |||
For more information: [https://slurm.schedmd.com/ Slurm Documentation] |
For more information: [https://slurm.schedmd.com/ Slurm Documentation] |
Revision as of 12:07, 29 November 2022
This page explains how to get the configure and run a batch job on the SLURM cluster from the head node.
1. Configuring SLURM
Slurm has 2 entities: a slurmctld controller node and multiple slurmd host nodes where we can run jobs in parallel.
To start the slurm controller on a machine we need to give this command from root:
$ systemctl start slurmctld.service
After running this, we can verify that slurm controller is running by viewing the log with the command:
$ tail /var/log/slurmctld.log
The nodes should be configured to run with the controller. You can see the information about available nodes using this command:
$ sinfo
In case the nodes show up in STATE "down", run the following command to put them in idle state (then check again using sinfo command):
$ scontrol update nodename=pnode[01-64] state=idle
2. Running a Slurm Batch Job on Multiple Nodes
We can create python scripts and run these scripts in parallel on the cluster nodes. This is a simple example where we will print the task number from each node.
Create a Python Script
First we create a Python script which prints the system task number:
#!/usr/bin/python # import sys library (needed for accepted command line args) import sys # print task number print('Hello! I am a task number: ', sys.argv[1])
We will save this python script as hello-parallel.py
Create Slurm Script
Next, we need to create a Slurm script to run the python program we just created:
#!/bin/bash # Example of running python script with a job array #SBATCH -J hello #SBATCH -p debug #SBATCH --array=1-10 # how many tasks in the array #SBATCH -c 1 # one CPU core per task #SBATCH -t 10:00 #SBATCH -o hello-%j-%a.out # Run python script with a command line argument srun python3 hello-parallel.py $SLURM_ARRAY_TASK_ID
We will save this Slurm script as hello-parallel.slurm
The first few lines of this file (with #SBATCH) are used to configure different parameters we want for the execution of the python script on the cluster.
For example, -J specifies job name, -p specifies the partition on which the cluster nodes are (for us the partition is named debug), --array specifies how many tasks we want, -c specifies number of CPU cores per task.
For more details on other command line options for sbatch for configuring the cluster, please visit Slurm Sbatch Documentation
Run Script and Check Output
Now we are ready to run the script on the cluster, specifically on 10 nodes of the cluster. To run the slurm script, simply give the command:
$ sbatch hello-parallel.slurm
This should run our python script on 10 different nodes and generate output files in the same location. The output files will have an extension .out.
We can view the output using the less command:
For more information: Slurm Documentation