Overview of CADES Kernel
CADES Kernel will be the set of resources and services deployed in the NCCS Open data enclave to serve ORNL researchers and their close collaborators. One of the first pieces to be deployed will be a 18,000 core cluster called Baseline. It will mount a 2.3 PB partition the Wolf GPFS filesystem for fast parallel active storage. Each user will be given a 30 GB home area hosted on the NCCS NFS Netapp. CADES users will be able to apply for allocations on the Nearline persistent storage system. Baseline will be offered at no additional cost to users and will be open to all ORNL research and technical staff.
Baseline Cluster Facts Sheet
Baseline HPC: The Baseline HPC footprint is single large (~18k core) HPC cluster which is available to all users in the lab who have an NCCS Open account. This resource will initially be fairshare scheduled across all users. Anyone who needs more time will follow a well-defined process for requests placed before the Institutional Compute and Data Advisory Council using a process similar to submitting proposals for time on Summit.
Baseline Compute Nodes
Nodes | Cores-per-node | Processor | Memory | GPU | GPU Memory | Chassis |
---|---|---|---|---|---|---|
68 | 128 | 2X AMD 7713 | 256 GB | N/A | N/A | 17 |
72 | 128 | 2X AMD 7713 | 512 GB | N/A | N/A | 18 |
Login Nodes
Baseline has 4 login nodes with the same hardware configuration as the higher memory compute nodes.
Nodes | Cores-per-node | Processor | Memory | GPU | GPU Memory | Chassis |
---|---|---|---|---|---|---|
4 | 128 | 2X AMD 7713 | 512 GB | N/A | N/A | 1 |
Scheduling Policy
Baseline's scheduling policy will be a modified Fair-Share. There will be limits on maximum walltime. This policy will be updated as needed to keep thoughput for jobs moving.
CADES Team will consider requests for reservation for urgent deadlines or realtime experiments. If reservations become a disruption to other users we reserve the right limit them. The Resource Utilization Council (RUC) will help us decide reservation approvals.
Storage
Baseline has three different types of associated storage with each option optimized for different stages of the data lifecycle.
NFS User Home Area Storage
Upon login users land in their personal home area, which is hosted on a Network File system (NFS) provided by Net App. Each user has a hard quotum of 30 GB for their home area.
User Home areas are designed to be the place where users keep actively used application codes, scripts, and starting data for applications. It is read-only from the compute nodes, meaning that the compute nodes cannot write data directly to the NFS home areas. This is because there is a fast parallel filesystem that is specifically optimized for parallel data I/O.
Fast Parallel Scratch Storage
CADES users share 2.3 PB of the Wolf General Parallel File System (GPFS) which sits in the NCCS Open Science enclave. This is the scratch area for active data generated by user application. GPFS is ideal for the parallel reads and writes done by HPC codes and it is fast and efficient. Wolf GPFS is not designed for long term storage because it is not backed up and files older than 90 days are continually purged. Users will want to move valuable data and codes off of Wolf as soon as they are no longer needed for an active simulation or analysis campaign.
Persistent and Archival Storage
We are planning a persistent storage option for CADES users to access by an allocation process.
Future Kernel Systems
In 2023 CADES is planning to procure a pay-per-allocation cluster with a hardware architecture that is similar to Baseline's. - Divisions and programs would buy time on the systems rather than purchasing the nodes directly. - Users would only pay for the time that they use. - CADES is in the process of gathering requirements, designing this system, and developing its cost model. Please let us know your needs. To meet with the program manager please email Suzanne Parete-Koon at paretekoonst@ornl.gov - More details will be available after the design of the system is finalized.