Slurm Quick Start

From mathpub
Jump to navigation Jump to search

This page explains how to get the configure and run a batch job on the SLURM cluster from the head node.

1. Configuring SLURM

Slurm has 2 entities: a slurmctld controller node and multiple slurmd host nodes where we can run jobs in parallel.

To start the slurm controller on a machine we need to give this command from root:

$ systemctl start slurmctld.service

After running this, we can verify that slurm controller is running by viewing the log with the command:

$ tail /var/log/slurmctld.log

The nodes should be configured to run with the controller. You can see the information about available nodes using this command:

$ sinfo




In case the nodes show up in STATE "down", run the following command to put them in idle state (then check again using sinfo command):

$ scontrol update nodename=pnode[01-64] state=idle


2. Running a Slurm Batch Job on Multiple Nodes

We can create python scripts and run these scripts in parallel on the cluster nodes. This is a simple example where we will print the task number from each node.

Create a Python Script

First we create a Python script which prints the system task number:

#!/usr/bin/python
# import sys library (needed for accepted command line args)
import sys
# print task number
print('Hello! I am a task number: ', sys.argv[1])

We will save this python script as hello-parallel.py

Create Slurm Script

Next, we need to create a Slurm script to run the python program we just created:

#!/bin/bash

# Example of running python script with a job array

#SBATCH -J hello
#SBATCH -p debug
#SBATCH --array=1-10                    # how many tasks in the array
#SBATCH -c 1                            # one CPU core per task
#SBATCH -t 10:00
#SBATCH -o hello-%j-%a.out


# Run python script with a command line argument
srun python3 hello-parallel.py $SLURM_ARRAY_TASK_ID

We will save this Slurm script as hello-parallel.slurm

The first few lines of this file (with #SBATCH) are used to configure different parameters we want for the execution of the python script on the cluster.

For example, -J specifies job name, -p specifies the partition on which the cluster nodes are (for us the partition is named debug), --array specifies how many tasks we want, -c specifies number of CPU cores per task.

For more details on other command line options for sbatch for configuring the cluster, please visit Slurm Sbatch Documentation

Run Script and Check Output

Now we are ready to run the script on the cluster, specifically on 10 nodes of the cluster. To run the slurm script, simply give the command:

$ sbatch hello-parallel.slurm

This should run our python script on 10 different nodes and generate output files in the same location. The output files will have an extension .out.

Slurm op.png



We can view the output using the less command:

Slrm.png
Slurm output.png





For more information: Slurm Documentation