NEMO Status and Usage

NEMO: normal operation with limitations

Published: 12 Apr 2024 by HPC Team Freiburg

After maintenance, we noticed problems with Omni-Path communication on the GPU node. We were unable to resolve them. We have decided to run the machine with 10Gbit/s Ethernet only for the time being. This only affects jobs computed on the GPU machine with the 8x V100 Nvidia cards.

NEMO Queue

The following diagrams show the current status of the NEMO queue. These diagrams give only a rudimentary estimate of the current status of the queue. The start of each job depends on the current cluster utilization, the resource requirements of the job and the historical usage (fairshare).

The outer line of this graph shows the available cores for jobs in %. All nodes, including GPU nodes and nodes for interactive jobs, are used for this calculation. The inner line shows all idle jobs and the cores they require as a percentage (see NEMO Job Cores for absolute numbers). Fairshare, job priorities and special job requirements are not considered in this image.

This diagram shows the current cluster utilization and how many resources will be free in the next 6h, 12h, 24h and 48h. Jobs reserve resources for a certain time, which is specified when the job is placed. The remaining runtime of jobs is calculated and displayed in this diagram. Normally, jobs end earlier than the reserved time, so only the worst case is shown here.

These are the current job core numbers in the NEMO queue. Jobs are usually temporarily blocked because some dependencies are not met or a user exceeds a usage limit and are automatically moved to the idle job queue once the running jobs are finished. Please run showq -b -v or checkjob -v JobID on the cluster to determine the reason for the block.

These are the current job numbers in the NEMO queue. Jobs are usually temporarily blocked because some dependencies are not met or a user exceeds a usage limit and are automatically moved to the idle job queue once the running jobs are finished. Please run showq -b -v or checkjob -v JobID on the cluster to determine the reason for the block.

This is the NEMO job queue. (*) EST in the queue shows the expected start time for the job with the highest priority. Jobs are usually temporarily blocked because some dependencies are not met or a user exceeds a usage limit and are automatically moved to the idle job queue once the running jobs are finished. Please run showq -b -v or checkjob -v JobID on the cluster to determine the reason for the block.

NEMO Usage Statistics

These numbers are only an estimate. The VRE number is calculated by adding all calculations from groups that use VRE. If some calculations are done on a bare-metal basis, we still add them to the VRE usage. However, we assume that almost all jobs computed by these groups are executed using VREs.

NEMO Project Usage

These numbers are only an estimate. We assume that four of our standard Intel nodes (Broadwell) consume between 700 and 1000 Watts including the chassis. The average consumption is around 800/850 W. This depends on the workload of the cores and the application. So, if there are four machines per chassis, one machine consumes about 200 W. The scheduler can allocate 20 real cores per machine. This means that one core consumes about 10 W. These numbers do not take into account storage, switch, rack and other infrastructure required. Example: If you use a whole node with 20 cores for 10 hours, you will consume 20*10W*10h = 2000Wh = 2kWh.