SENECA

General Information

  • Login node: seneca-login1.cac.cornell.edu (access via ssh)
    • OpenHPC deployment running RockLinux 9
    • Scheduler: slurm 25.05.3
  • Please send any questions and report problems to: cac-help@cornell.edu

How To Login

  • To get started, login to the login node seneca-login1.cac.cornell.edu via ssh.

    ssh cacuser@seneca-login1.cac.cornell.edu <-- substitute your cacid for cacuser

  • You will be prompted for your CAC account password. If you need to change password, go to the CAC portal.

  • If you are unfamiliar with Linux and ssh, we suggest reading the Linux Tutorial and looking into how to Connect to Linux before proceeding.
  • NOTE: Users should not run codes on the head node. Users who do so will be notified and have privileges revoked.

Hardware and Networking

Partitions

"Partition" is the term used by slurm for designated groups of compute nodes

  • gpu_only partition

Running Jobs / Slurm Scheduler

CAC's Slurm page explains what Slurm is and how to use it to run your jobs. Please take the time to read this page, giving special attention to the parts that pertain to the types of jobs you want to run.

The Slurm Quick Start guide is a great place to start.

  • NOTE: Users should not run codes on the login node.
    A few slurm commands to initially get familiar with:
    
    sinfo -l
    scontrol show nodes
    scontrol show partition
    
    Submit a job: sbatch testjob.sh
    
    scontrol show job [job id]
    scancel [job id]
    
    squeue -u userid
    

How to submit a GPU job

  • Know Your Account

    You must specify a valid Slurm account (--account=) for your job.

    Use the account given to you for your project (e.g., hwh57_0002 for Qatar sysadmins).
    Your account must be permitted in gpu_only:
    Projects in qatar: Always allowed.
    Projects in ithaca / weill: Allowed only if your balance (balance > 0) in the project list.

  • Request the GPU Partition Explicitly

    When using srun, Slurm ignores #SBATCH --partition in scripts.

    So you must give the partition on the command line:

    Example:
    -p gpu_only or: --partition=gpu_only

  • Request a GPU Resource

    Jobs in gpu_only must include a valid --gres specification in the format:

    --gres=gpu:<type>:<count>

    For example, to request one NVIDIA H100 GPU:

    --gres=gpu:h100:1

    This applies to:

    jobs (srun CLI) → use --gres in the command line.
    jobs (sbatch script) → include:

    #SBATCH --gres=gpu:h100:1
    
  • Time Limits and QOS Rules

    Default maximum time in gpu_only is 72 hours (3 days). You must request IT for jobs longer than 72 hours: After approved, you will be able to use QOS longrun:
    --qos=longrun Without --qos=longrun, jobs over 72h will be rejected. Examples:

    # OK: 3 days (72h) or less
    --time=48:00:00
    
    # OK: >72h with longrun QOS
    --time=96:00:00 --qos=longrun
    
  • Minimal Examples

    Interactive job:

    srun -p gpu_only --account=hwh57_0002 --gres=gpu:h100:1 --time=02:00:00 /bin/bash

    Batch job script (gpu_job.sh):

   
    #!/bin/bash
    #SBATCH --job-name=my_gpu_job
    #SBATCH --account=hwh57_0002
    #SBATCH --partition=gpu_only
    #SBATCH --gres=gpu:h100:1
    #SBATCH --time=02:00:00

    module load python/3.10
    python my_gpu_script.py

    Submit/Run with:
    sbatch gpu_job.sh
  • Common Rejection Reasons

    Your job will be denied if:

    • No --account specified.
    • Partition not specified on srun CLI (for interactive jobs).
    • Account not allowed in gpu_only (balance ≤ 0 or wrong group).
    • No valid --gres=gpu:<type>:<count>.
    • Requested time > 72h without --qos=longrun.

Your job scratch data to /tmp to avoid heavy I/O from your nfs mounted $HOME !!!

Explanation: /tmp refers to a local directory that is found on each compute node. It is faster to use /tmp because when you read and write to it, the I/O does not have to go across the network, and it does not have to compete with the other users of a shared network drive (such as the one that holds everyone's /home).

To look at files in /tmp while your job is running, you can ssh to the login node, then do a further ssh to the compute node that you were assigned. Then you can cd to /tmp on that node and inspect the files in there with cat or less.

Note, if your application is producing 1000's of output files that you need to save, then it is far more efficient to put them all into a single tar or zip file before copying them into $HOME as the final step.

Software

Software List