Research Computing Guidance

Budget Guidance for HPC

whether starting a new research project or continuing one at Mines this document will help you find the resources to get your High-Performance Computing workload done. Some useful free compute cycles through programs such as RMACC via Alpine or NSF ACCESS Program, will help you get some exploratory compute cycles to awarded a grant, and might be used through out the project.

Mines HPC systems on-premise are operated under a Operational Expenditure (OpEX) cost model, where researchers pay for the CPU/hours cycles used. This document aims to help write future research grants that use HPC resources at Mines.

Schedule a consultation appointment at RC.mines.edu

Life Cycle of Research Computing Projects

Life Cycle of an HPC Project

Stage 1: Identify area for exploratory research to expand current scientific knowledge base

Choice of Software
- Vet software for security and other restrictions
- Open Source vs Commercial
  - Commercial: Licensing requirements
    - FlexLM server
    - Site-wide, single user, multi-user, etc.
  - Open Source
    - Optimized build for compute architecture
- Self-authored software
  - Identify software languages, compilers, libraries required, etc.

Stage 2: Development of scripts, scaling & benchmarking

Identify workload and workflows
- Common Parallel Workflows:
  - Multi-node (e.g. MPI)
  - Multi-threaded (e.g. OpenMP)
  - Single or Multi GPU (NVIDIA using CUDA, AMD using OpenCL, etc)
  - Embarrassingly Parallel – Parameter Sweeps, Markov Chain Monte Carlo Simulations
- Other considerations
  - High memory (relative to CPU requirements)
  - High data I/O (e.g. machine learning)

Stage 2 (alternative): Find published benchmark (little to no experience with software) or manual benchmarking (established codebase or simulation input deck)

Run test problem across multiple hardware configurations:
- Local workstation or laptop
- Freely Accessible HPC
- Performance Analysis Consultation Checklist
  - A working test problem of model/code
    - What compute resources were used to verify code? (No. of cores, memory, etc)
  - Model can use multiple cpus and/or nodes
  - Anticipated number of runs of model
    - Parameter matrix
- Outcomes: Performance Analysis report
  - Weak scaling (vary problem size proportionally to no. of processors)
  - Strong scaling (fixed problem size -> increase no. of processors)
  - Scaling analysis will estimate to user the amount of core-hours per job they need

Stage 3: Pre-Funded HPC options

No funding:
- CU Boulder’s Alpine via RMACC
- Open Science Grid (if user has an Embarassingly Parallel workflow)
- NSF ACCESS Program
  - Proposal required, but grant is not required

Stage 3 (grant supported HPC): Perform baseline parallel and efficiency analysis on Pre-Funded HPC choice

Use as an approximate scale -> estimate budget
Secure funding
Secure allocation on paid/NSF cluster, or fund AWS HPC

Stage 4: Full-scale Computing & Visualization

Setting up computing environment
- Request software install
- Project Allocation
- User access account
- Shared storage and project storage
Performance batch jobs
Data management
Visualization setups

Stage 5: Post-processing & continued training of new researchers

Data insights, data sharing, project publications
Training for new researchers

Stage 6: Project close-out and data transfer to long term storage

Metadata tagging
Archive ready datasets

Parameters to Determine

Nodes x Cores: total number of Cores used for simulation
Total jobs: Jobs needed to complete research project
Core Run Time (Cores * Hours): Wirk done by each CPU (CPUh)
Storage of Input and Output: Data sets inputs (multi-use or single-use)
Cost / (Core-Hour): Wendian = $0.02 (CPUh), Apex (AWS cloud) $0.04 and up and GPUs even more ie. $96/hour
Storage during & after simulation: S3-Bucket, Orebits, fast NVMe
Job Parametric Study: Number of simulations to assess science domain
GPU-Hour (GPUh): Increased costs for specialize resources. Wendian GPU V100 = 6 core-hour or $0.12.