NEWS
2023
Mid-November, CEA/DAM teams took part in SuperComputing 23 in Denver (USA), the largest international event dedicated to HPC. The event was an opportunity for CEA to showcase its technical resources, expertise in HPC, AI, quantum computing and scientific computing, as well as its upcoming projects. The CEA stand was the scene of numerous contacts and discussions, in this pivotal year when the CCRT (Centre de Calcul Recherche et Technologie) is celebrating 20 years of industrial partnerships, and the Jules Verne project for an Exascale machine, to be installed at the TGCC (Très Grand Centre de Calcul du CEA) from 2025, has been selected by EuroHPC.
CEA faces a significant challenge when it comes to monitoring its computing center EXA1. The tens of thousands of servers and equipment that make up the center generate thousands of indicators per second. Detecting a malfunction or anomaly among this vast amount of data is a daunting task that can be almost impossible for human operators.
To streamline the supervision process, the Military Applications Division of CEA has developed a control system capable of monitoring all event flows within the computing center. This system can ingest and analyze an immense amount of monitoring data, including system events, performance indicators, and energy consumption metrics. It can identify anomalous situations that require special attention. With the help of Artificial Intelligence algorithms (AI), this system is not only capable of precisely identifying the implicated components of the malfunction (saving valuable time in the search for the causes and consequences of the disturbance) but also of detecting hard-to-detect issues and even anticipating them.
These two graphs represent the usage of electrical power for computing (top graph) and the electrical power supplied by one of the inverters (bottom graph). Normally, the electrical power levels fluctuate over time depending on the calculations being performed.
In this case, the overall graph did not reveal any noticeable anomalies. To detect an anomaly, it would have required an operator to individually inspect the power consumption of each component. However, the AI algorithm highlights a specific abnormal overconsumption on inverter number 2, indicated by the green segment on the second graph. The system promptly alerts operators to this flagged abnormality, drawing attention to a potential issue that would have otherwise gone unnoticed. In this particular instance, the abnormal behavior was attributed to a wiring error.
Therefore, the system not only saves time by automatically detecting deviations from expected patterns but also provides targeted information to operators, indicating the need for further investigation specifically regarding inverter number 2.
Given these promising results, this system will now be applied to other aspects of the Computing centres such as hardware failure detection, optimization of computing and storage resource utilization, and identifying security vulnerabilities. In the longer term, AI can be used to automatically determine the appropriate management policy for the computing center: high-performance mode, energy-saving mode (useful in the event of an energy crisis), or degraded mode following a failure.
In conclusion, the Military Applications Division of CEA has developed an AI-based system to monitor its supercomputer EXA1, a significant breakthrough in managing large-scale Computing centres. This system allows for the detection of operational anomalies and saves time in identifying their origins. This innovation paves the way for further applications in the field of critical infrastructure monitoring, particularly in the areas of energy and cybersecurity. It is also worth noting that it could be used to improve energy efficiency and reduce operational costs in Computing centres. The potential applications for this technology are numerous, and it will be interesting to follow future developments in this field.
Source:
“ML-based methodology for HPC facilities supervision”, Laetitia Anton, Sophie Willemot, Sebastien Gougeaud (CEA, France) and Soraya Zertal (Univ of Versailles, France), ISC/MODA 2023
URL : https://link.springer.com/chapter/10.1007/978-3-031-40843-4_23
The EuroHPC European Joint Undertaking today announced that it has selected the Jules Verne consortium, driven by France, for the second European Exascale supercomputer.
Jules Verne consortium brings together France, represented by the Grand Equipement National de Calcul Intensif (GENCI) as hosting entity, in collaboration with CEA as hosting site, and the Netherlands, represented by SURF, the Dutch national HPC and e-Science support center.
Read more (CEA)
Euro HPC website
ISC High Performance 23, Hamburg
Forum TERATEC 23, Paris
Teams from CEA/DAM took part in "ISC High Performance 2023" in Hamburg (Germany), as well as in the Teratec forum in Paris, to present their know-how and activities in High Performance Computing, partnership opportunities around CEA's Very Large Computing Center (TGCC) and CCRT (Computing Centre for Research and Technology, and to get together with the whole scientific and industrial HPC community.
NumPEx: launch of an investment plan for exascale
Co-piloted by the CEA, the CNRS and the Inria, the priority research program and equipment NumPEx (Numerical for the exascale, the future range of "exascale" supercomputers) was subsequently launched on February 24, 2023 at the Very Large Center of the CEA (TGCC) by Mrs Sylvie Retailleau, Minister of Higher Education and Research, and Jean-Noël Barrot, Minister Delegate in charge of the Digital Transition and Telecommunications.
Read more (in french)