CEEC Progress Towards more energy-efficient CFD

CEEC Progress Towards More Energy-Efficient CFD

Energy efficiency is an emerging central challenge for modern high-performance computing (HPC) systems, where escalating computational demands and architectural complexity lead to significant energy footprints. Below is a presentation of our experience measuring, analyzing, and optimizing energy consumption across major European HPC systems. Through case studies using representative CFD applications waLBerla, FLEXI/GALÆXI, Neko, and NekRS, we evaluate energy-to-solution and time-to-solution on diverse architectures, including CPU- and GPU-based partitions of LUMI, MareNostrum5, MeluXina, and JUWELS Booster. Our results highlight the advantages of accelerators and mixed-precision techniques for reducing energy consumption while maintaining computational accuracy.

Although this content was originally designed as a poster for display at EuroHPC Summit 2026 in Cyprus, this event is unfortunately postponed indefinitely. We hope that this blog post can act as at least a small substitute for the lost opportunities for knowledge exchange.

Lighthouse Case Codes

Within the scope of the CEEC project, the waLBerla multiphysics framework is employed to develop a fully resolved, coupled fluid-particle numerical model, referred to as lighthouse case 4 (LHC4). This model is designed to investigate the phenomenon of piping erosion, which poses a significant threat to geotechnical structures such as offshore wind turbine foundations and dams. waLBerla demonstrates clear benefits of GPU acceleration for coupled fluid–particle simulations when evaluated using node-level SLURM energy measurements. On LUMI, GPU nodes significantly reduce both time-to-solution and energy-to-solution compared to CPUs. The study also reveals kernel-dependent behavior: while most components favor GPUs, certain reduction operations remain comparatively energyefficient on CPUs, highlighting the importance of fine-grained energy analysis.

Alt text: Two side-by-side bar charts compare performance for the “Partition” workload on CPU vs GPU. The legend shows LCML-C on an AMD EPYC 7763 (blue hatched) and LCML-G on an AMD MI250X (orange hatched). Left chart, Time-to-solution (seconds): CPU 220.47 s, GPU 13.02 s. Right chart, Energy-to-solution (kJ): CPU 154.74 kJ, GPU 23.67 kJ. Overall, the GPU is much faster and uses much less energy than the CPU for this workload.

Alt text: Two side-by-side bar charts compare energy production (GWh) for LUM-C (A\WD) under two PPC values: 7760 (blue diagonal-hatched bars) and 2500 (orange cross-hatched bars), across three stages—Initial, Mapping, and Reconstruction. In the left chart, PPC 7760 is much higher at every stage (Initial 140.6 vs 6.7; Mapping 30.39 vs 1.94; Reconstruction 30.59 vs 1.99). In the right chart, PPC 7760 is higher for Initial and Mapping (Initial 110.6 vs 10.0; Mapping 19.95 vs 5.85), but in Reconstruction PPC 2500 slightly exceeds PPC 7760 (5.48 vs 4.49).

In the context of CEEC, the open-source, high-order accurate flow solver FLEXI and its GPU counterpart GALÆXI are utilized to examine the shock buffet on a 3D wing under transonic flight conditions, referred to as lighthouse case 1 (LHC1). Shock buffet leads e.g. to increased structural fatigue of the wing in the long run, and thus has implications for aircraft safety and efficiency. Energy measurements for FLEXI/GALÆXI across multiple European systems show that GPUs consistently outperform CPUs in energy-to-solution, when the simulation is large enough to provide sufficient workload per device. Using the EPID (the energy normalized performance index) and energy ratio metrics, the study reveals that energy efficiency degrades faster than parallel efficiency at scale, especially on CPUs. GPU partitions deliver superior energy efficiency and runtime, emphasizing the need for workload-aware scaling and accelerator-centric design in high-order discontinuous Galerkin solvers.

Two side-by-side line charts compare “Parallel efficiency (OFD)” (blue) and “Energy efficiency (OFD)” (orange) as scaling increases. Left chart plots efficiency versus number of CPUs on a log scale (about 10^2 to 10^4): parallel efficiency declines gradually from near 1.0 to about 0.75, while energy efficiency drops steeply from near 1.0 to about 0.15. Right chart plots efficiency versus number of devices on a log scale (about 10^1 to 10^2): parallel efficiency stays nearly flat around 0.9–0.95, while energy efficiency starts near 1.0, plunges to about 0.2 mid-range, then partially recovers to roughly 0.5–0.6 at higher device counts.

Side-by-side bar charts compare three systems for a “Partition” workload: VNS-GPP (Intel Sapphire Rapids x8480+), VNS-ACC (NVIDIA H100), and LUMI-G (AMD MI250X). Left chart (Time-to-solution, ×10^3 s): VNS-GPP 30.61, VNS-ACC 12.81, LUMI-G 9.62 (lowest). Right chart (Energy-to-solution, MJ): VNS-GPP 32.16, VNS-ACC 15.67, LUMI-G 13.34 (lowest). Overall, LUMI-G is fastest and uses the least energy; VNS-GPP is slowest and uses the most energy.

Neko is a portable simulation framework based on high-order Spectral Element Methods (SEM) on hexahedral meshes, mainly focusing onincompressible flow simulations. In this mixed-precision study, we focuson solving the Poisson equation for pressure, one of two equations, in the Navier-Stokes equation using the preconditioned Conjugate Gradient(PCG). For Neko, energy measurements collected with the EnergyAware Runtime (EAR) highlight the strong potential of mixed-precisionsolvers to reduce energy consumption. By relaxing solver tolerances and combining mixed precision with spectral element discretization,Neko achieves lower energy-to-solution and faster runtimes by up to1.3x compared to double precision. The results demonstrate a direct link between numerical accuracy requirements, solver configuration, and energy efficiency, supporting adaptive precision as a key strategy for sustainable CFD simulations.

Line chart of residual versus iterations (0–1000) on a logarithmic y-axis (about 10^5 down to 10^-10), titled “E = 8192, N = 7 (28 MPI ranks, gfortran, -O2)”. Three Jacobi solver runs are compared: Global fp64 (light blue dotted) steadily decreases, reaching about 10^-9 by roughly 600–650 iterations; fp64+fp32 (fp64 GS) (orange) also steadily decreases, reaching about 10^-9 by about 800 iterations; Global fp32 (dark blue) drops initially to about 10^-2 around 350–400 iterations but then reverses and grows/oscillates upward to around 10^1 by 700–1000 iterations, indicating loss of convergence in fp32 while fp64 and mixed precision converge.

Two stacked line charts compare time-to-solution versus solver tolerance for mixed-precision (orange squares) and double-precision (blue triangles). Both charts use a logarithmic tolerance axis labeled 10^-10 through 10^-4 and a y-axis labeled “Time-to-solution (s)”. In the top chart, double-precision starts slower (about 10 s at 10^-10) and drops as tolerance loosens, converging with mixed-precision near 10^-6; mixed-precision stays lower and relatively flat, decreasing from roughly 7.5 s to about 6.5 s by 10^-4. In the bottom chart, both methods speed up as tolerance loosens, but mixed-precision remains faster across all tolerances, falling from about 9.2 s at 10^-10 to about 7.5 s at 10^-4, while double-precision drops from about 10.7 s to about 8.3 s.

During the CEEC project, the GPU-accelerated high-order SEM code NekRS was employed to perform ultra-high-resolution Large Eddy Simulations (LES) for investigating both stable and convective (unstable) Atmospheric Boundary Layer (ABL) dynamics, collectively referred to asLHC5. For validation and model inter-comparison, the Global Energyand Water Cycle Experiment (GEWEX) Atmospheric Boundary LayerStudy (GABLS) benchmark was used to represent the stably stratified ABL. NekRS energy analysis using LLview on GPU-based systems shows that energy-to-solution scales proportionally with problem size when GPUs are efficiently utilized. High bandwidth efficiency enables excellent energy performance at scale, but underutilization due to communication overhead leads to rising energy ratios. The results underline that NekRSis primarily bandwidth- and communication-limited, and that maintaininghigh parallel efficiency is critical for sustaining energy efficiency in largescale LES simulations.

Bar chart of energy consumption (kWh) versus number of devices for three configurations (512³, 1024³, 2048³). For 512³, energy use stays under 1 kWh and rises slightly from 0.8 (24 devices) to 0.99 (96 devices). For 1024³, energy use is about 8 kWh across 192–384 devices (8.1 at 192, 7.83 at 288, 8.32 at 384). For 2048³, energy use is much higher and increases with device count, from 70 kWh (760 devices) to 82 kWh (2040 devices).

Line chart on a black background showing energy ratio (y-axis, about 0.7 to 1.3) versus number of devices (x-axis, logarithmic, with labels around 10^2 and 10^5). Three colored series (red, purple, orange; likely different problem sizes such as 512, 1024, and 2048) are each plotted with a solid line labeled “Parallel Reference” and a dashed line labeled “Boopy Ratio.” For all three series, the solid “Parallel Reference” values decrease as device count increases (from about 1.0 down to roughly 0.7–0.75), while the dashed “Boopy Ratio” values increase slightly above 1.0 with more devices (up to around 1.1–1.2).

Conclusion

Porting CFD applications to GPUs, one of the goals of the CEEC project, demonstrated superior achievement of favorable energy-to-solution metrics compared to CPU-based systems. The application case studies within CEEC highlighted that our codes are well optimized and demonstrate exceptional scalability. However, the results also show that underutilization of computational resources can have detrimental effects on both performance and energy consumption. Furthermore, the adoption of mixed-precision techniques has been shown to provide an effective balance between computational accuracy and energy efficiency, representing a promising direction for sustainable exascale applications. The findings emphasize that optimization should not be limited to runtime reduction alone, but must equally consider the compute/storage precision and energy implications of numerical and architectural choices.

For further reading, check out our related publication: https://dl.acm.org/doi/10.1145/3784828.3785161