SENECA
General Information
- Login node: seneca-login1.cac.cornell.edu (access via ssh)
- OpenHPC deployment running Rocky Linux 9
- Scheduler: slurm 25.05.3
- Please send any questions and report problems to: help@cac.cornell.edu
Hardware and Networking
Servers
| Name | CPU Threads | RAM | GPUs | Public IP | Private IP | IPoIB IP |
|---|---|---|---|---|---|---|
| seneca | 64 | 1 TB | N/A | 128.84.3.170 | 192.168.16.254 | 192.168.17.254 |
| seneca-login1 | 64 | 1 TB | N/A | 128.84.3.171 | 192.168.16.253 | 192.168.17.253 |
| Compute Nodes | ||||||
| c0001 | 128 | 1 TB | 4 x Nvidia H100 80 GB HBM3 | N/A | 192.168.16.1 | 192.168.17.1 |
| c0002 | 128 | 1 TB | 4 x Nvidia H100 80 GB HBM3 | N/A | 192.168.16.2 | 192.168.17.2 |
| c0003 | 128 | 1 TB | 4 x Nvidia H100 80 GB HBM3 | N/A | 192.168.16.3 | 192.168.17.3 |
| c0004 | 128 | 1 TB | 4 x Nvidia H100 80 GB HBM3 | N/A | 192.168.16.4 | 192.168.17.4 |
| BeeGFS Parallel File System | ||||||
| bgfs-meta1 | 64 | 256 GB | N/A | N/A | 192.168.16.249 | 192.168.17.249 |
| bgfs-meta2 | 64 | 256 GB | N/A | N/A | 192.168.16.248 | 192.168.17.248 |
| bgfs-storage1 | 64 | 256 GB | N/A | N/A | 192.168.16.247 | 192.168.17.247 |
| bgfs-storage2 | 64 | 256 GB | N/A | N/A | 192.168.16.246 | 192.168.17.246 |
Networks
The cluster nodes are connected by the following networks:
- Public network: The cluster head node and login node are connected to the public
128.84.3.0/24network via 25 Gb ethernet. - Private network: All cluster nodes are connected to a private
128.84.16.0/24network via 25 Gb ethernet. - IPoIB network: All cluster nodes are connected to a private
128.84.17.0/24network via IPoIB network. This is for BeeGFS file system access.
In addition, Cluster nodes are connected by a NDR Infiniband interconnect (400 Gbit NDR).
Access
You must use an ssh key to log into the seneca cluster. Password logins are disabled. See the Initial Login via SSH Key section on how to generate an ssh keypair on CAC Portal to log into the cluster for the first time.
Initial Login via SSH Key
After being added to a project with access to Seneca cluster, you can:
-
Log into the CAC Portal by clicking on the
Loginbutton in the upper right corner. -
Choose
Cornell UniversityorWeill Cornell Medical Collegeas your organizational login and complete the login using your Cornell NetID or Weill CWID and password (not your CAC account password). -
On the Portal dashboard, click on the
Generate SSH key pairlink in theManage your CAC login credentialssection in the upper right corner. -
Click on the
Generate a new SSH keybutton to generate a new ssh key pair. -
Click on the
Download your SSH private keybutton to download the private key to your computer. -
On your computer, make sure the private key file is readable and writable to you only:
chmod 600 <private key file> -
SSH to seneca cluster using the private key you just downloaded:
ssh -i <path to the private key file> <NetID>@seneca-login1.cac.cornell.edu -
On seneca-login1, you can add additional public ssh keys to your
~/.ssh/authorized_keysfile, 1 key per line. The key generated by CAC Portal is marked by theCAC_SSOcomment in theauthorized_keysfile. Keys that do not end inCAC_SSOwill be left alone by CAC Portal.
Storage
Home Directories
- Path:
~
Each user has a home directory hosted on the head node and exported to the login node and compute nodes via NFS. The home directory quota by default is set to 100 GB. If your job produces lots of file I/O, use the BeeGFS Parallel File System or Local Scratch for scratch space, and copy the results back to your home directory if desired. Your job will run faster and not slow the home directory and scheduler performance for other cluster users.
BeeGFS Parallel File System
- Path:
/mnt/beegfs/<institution>/<project name>
<institution> is one of:
ithaca: for projects from Ithaca campusqatar: for projects from Weill Cornell Medicine Qatar campusweill: for projects from Weill Cornell Medicine NYC campus
Each project has a directory on the BeeGFS parallel file system shared by all its members. The default quota is 10 TB per project.
Local Scratch
- Path:
/tmp
Each login node and compute node has a local scratch disk mounted on /tmp.
Scheduler
Users gain access to compute nodes using the slurm scheduler. The Slurm Quick Start Guide is a great place to start. For more detailed explanations, see CAC's Slurm page.
Partitions
Currently the seneca cluster has 1 partition:
| Partition | Nodes | Resources |
|---|---|---|
| gpu_only | c000[1-4] | Each node has 128 CPU threads, 1 TB RAM, 4 Nvidia H100 GPUs |
Submit a Job
- Interactive Job:
srun --pty <options> /bin/bash
After slurm allocates a node as specified by your options, you will be given a login prompt on the compute node to run jobs interactively.
- Batch Job:
sbatch <options> <job script>
Slurm will run the job script on the allocated node(s) as specified by your options.
The following options are relevant in seneca cluster. Required options are denoted by *.
-
-P, or --partition*: Partition. Currently
gpu_onlyis the only partition. More partitions will be added in the future. -
-A, or --account*: Slurm account/CAC project against which this job will be billed.
If you do not specify --account, your Slurm DefaultAccount will be used automatically.
You can view your DefaultAccount with:
sacctmgr show user $USER format=User,DefaultAccount
Qatar projects are always allowed to submit to the gpu_only partition. All other projects must have a positive balance at job submission time or the request will be rejected.
- --gres*: Requested GPU resource
For the gpu_only partition, at least 1 GPU must be requested like this:
--gres=gpu:<type>:<count>
For example, to request 2 NVIDIA H100 GPUs: --gres=gpu:h100:2
- --time*: Time limit in the HH:MM:SS format
Maximum time limit for the gpu_only partition is 72 hours (3 days).
If you need to run more than 72 hours in the gpu_only partition, email your request to help@cac.cornell.edu.
- --qos:
longrunfor long running (time limit >72 hours) jobs
If approved to run more than 72 hours, use the --qos=longrun option to request time limit longer than 72 hours. For example:
# OK: 3 days (72h) or less --time=48:00:00 # OK: >72h with longrun QOS if your project is approved to run longer than 72 hours --time=96:00:00 --qos=longrun # FAIL: >72h without longrun QOS --time=96:00:00
Here are some minimal examples:
- An interactive job with 4 CPU threads/2 physical CPU cores and 1 Nvidia H100 GPU with time limit of 2 hours:
srun -p gpu_only --account=abc123_0002 --gres=gpu:h100:1 --time=02:00:00 -c 4 /bin/bash
- The following job script (
gpu_job.sh) can be submitted using thesbatch gpu_job.shcommand to run on 4 CPU threads/2 Physical CPU cores and 1 Nvidia H100 GPU with time limit of 2 hours:
#!/bin/bash #SBATCH --job-name=my_gpu_job #SBATCH --account=abc123_0002 #SBATCH --partition=gpu_only #SBATCH --gres=gpu:h100:1 #SBATCH --time=02:00:00 #SBATCH -c 4 module load python/3.10 python my_gpu_script.py