CADES OR DGXs

Check eligibility to access the CADES Open Research (or) DGX machines

The DGXs are available for users in UNIX groups corresponding to birthright and CCSD slurm accounts. These UNIX groups are:

For example, for a fictional user 'abc' on any cades machine:

* [abc@or-slurm-login07 ~]# groups adc
    abc : users cades-ccsd cades-birthright ucams
* [abc@or-slurm-login07 ~]#

Here membership in cades-ccsd will give user 'abc' access corresponding to the CCSD slurm account. And, membership in cades-birthright will give user 'abc' access corresponding to the Birthright slurm account.

Please check you availabilty to access the DGX resources in Open Research.

Access to CADES or DGX

As mentioned earlier, the DGX are open for people with birthright and CCSD accesses to submit slurm jobs.

Login access to these systems is gated by the slurm submit node. Users can access with:

* ssh or-dgx-login01.ornl.gov 
     or 
* ssh or-dgx-login02.cades.ornl.gov

Users may also login directly to the two compute nodes, although this access might be restricted to user who already have running jobs.

* ssh ucamsID@dgx2-a.ornl.gov 
     or 
* ssh ucamsID@dgx2-b.ornl.gov

Both nodes are in separate queues, so you can choose one or the other based on the queue name:

For CCSD users:

* SBATCH -p dgx2a
     or
* SBATCH -p dgx2b

The max wall time for the dgx2a and dgx2b queues is unlimited.

For birthright users:

* SBATCH -p dgx2a-birthright 
     or
* SBATCH -p dgx2b-birthright

The max wall time for the birthright queues is 72 hours and CCSD jobs can preempt birthright jobs.

Software on DGX

Software on DGX is delivered by container. They have a directory called /containers that has several containers built for common software like tensorflow. Other than that, User are expected to build their own software for now.

Storage on DGX

These mount NFS and Lustre in the standard way for CADES OR condo. In addition, there is very fast, local storage called /localscratch/data/ that is meant for fast I/O during job execution. Please note that storage on /localscratch/data/ should be copied off to Lustre or NFS for long term storage.

So to summarize:

* NFS                    /home/nfs/*                          Persistent long after job completion.
* Lustre                 /lustre/or-scratch/*                   Persistent long after job completion.
* Fast Local Scratch     /localscratch/data/*                 Self-serve, persistence not guaranteed for extended periods beyond a job's lifecycle.