SENECA

General Information

Hardware and Networking

Access

You must use an ssh key to log into the seneca cluster. Password logins are disabled. See the Initial Login via SSH Key section on how to generate an ssh keypair on CAC Portal to log into the cluster for the first time.

Initial Login via SSH Key

After being added to a project with access to Seneca cluster, you can:

  1. Log into the CAC Portal by clicking on the Login button in the upper right corner.

  2. Choose Cornell University or Weill Cornell Medical College as your organizational login and complete the login using your Cornell NetID or Weill CWID and password (not your CAC account password).

  3. On the Portal dashboard, click on the Generate SSH key pair link in the Manage your CAC login credentials section in the upper right corner.

  4. Click on the Generate a new SSH key button to generate a new ssh key pair.

  5. Click on the Download your SSH private key button to download the private key to your computer.

  6. On your computer, make sure the private key file is readable and writable to you only: chmod 600 <private key file>

  7. SSH to mbot cluster using the private key you just downloaded: ssh -i <path to the private key file> <NetID>@seneca-login1.cac.cornell.edu

  8. On seneca-login1, you can add additional public ssh keys to your ~/.ssh/authorized_keys file, 1 key per line. The key generated by CAC Portal is marked by the CAC_SSO comment in the authorized_keys file. Keys that do not end in CAC_SSO will be left alone by CAC Portal.

Storage

Home Directories

Your job scratch data to /tmp to avoid heavy I/O from your nfs mounted $HOME !!!

Explanation: /tmp refers to a local directory that is found on each compute node. It is faster to use /tmp because when you read and write to it, the I/O does not have to go across the network, and it does not have to compete with the other users of a shared network drive (such as the one that holds everyone's /home).

To look at files in /tmp while your job is running, you can ssh to the login node, then do a further ssh to the compute node that you were assigned. Then you can cd to /tmp on that node and inspect the files in there with cat or less.

Note, if your application is producing 1000's of output files that you need to save, then it is far more efficient to put them all into a single tar or zip file before copying them into $HOME as the final step.

BeeGFS Parallel File System

Local Scratch

Scheduler

Users gain access to compute nodes using the slurm scheduler. The Slurm Quick Start Guide is a great place to start. For more detailed explanations, see CAC's Slurm page.

Partitions

Currently the seneca cluster has 1 partition:

Partition Nodes Resources
gpu_only c000[1-4] Each node has 128 CPU threads, 1 TB RAM, 4 Nvidia H100 GPUs

Submit a Job

  • Interactive Job: srun --pty <options> /bin/bash

After slurm allocates a node as specified by your options, you will be given a login prompt on the compute node to run jobs interactively.

  • Batch Job: sbatch <options> <job script>

Slurm will run the job script on the allocated node(s) as specified by your options.

The following options are relevant in seneca cluster. Required options are denoted by *.

  • -P, or --partition*: Partition. Currently gpu_only is the only partition. More partitions will be added in the future.

  • -A, or --account*: Slurm account/CAC project against which this job will be billed.

Qatar projects are always allowed to submit to the gpu_only partition. All other projects must have a positive balance at job submission time or the request will be rejected.

  • --gres*: Requested GPU resource

For the gpu_only partition, at least 1 GPU must be requested like this:

--gres=gpu:<type>:<count>

For example, to request 2 NVIDIA H100 GPUs: --gres=gpu:h100:2

  • --time*: Time limit in the HH:MM:SS format

Maximum time limit for the gpu_only partition is 72 hours (3 days).

If you need to run more than 72 hours in the gpu_only partition, email your request to help@cac.cornell.edu.

  • --qos: longrun for long running (time limit >72 hours) jobs

If approved to run more than 72 hours, use the --qos=longrun option to request time limit longer than 72 hours. For example:

``` # OK: 3 days (72h) or less --time=48:00:00

# OK: >72h with longrun QOS if your project is approved to run longer than 72 hours --time=96:00:00 --qos=longrun

# FAIL: >72h without longrun QOS --time=96:00:00 ```

Here are some minimal examples:

  • An interactive job with 4 CPU threads/2 physical CPU cores and 1 Nvidia H100 GPU with time limit of 2 hours: srun -p gpu_only --account=abc123_0002 --gres=gpu:h100:1 --time=02:00:00 -c 4 /bin/bash

  • The following job script (gpu_job.sh) can be submitted using the sbatch gpu_job.sh command to run on 4 CPU threads/2 Physical CPU cores and 1 Nvidia H100 GPU with time limit of 2 hours: ``` #!/bin/bash #SBATCH --job-name=my_gpu_job #SBATCH --account=abc123_0002 #SBATCH --partition=gpu_only #SBATCH --gres=gpu:h100:1 #SBATCH --time=02:00:00 #SBATCH -c 4

    module load python/3.10 python my_gpu_script.py ```

Software