- We are working on a lightweight [very light secure app monitoring approach]
- User’s group with bi-weekly telecons. Face-to-face meeting planned for fall 2019 -- Join Now!
- Telecon notes and call in info at github-wiki
- Sandia-UIUC collaboration on AI for Supercomputer Diagnostics
- 2017 ISC High Performance 2017 (ISC) Gauss Award Winner: Diagnosing Performance Variations in HPC Applications Using Machine Learning - using LDMS monitoring data as the basis for Machine Learning-based Performance Diagnosis
- LDMS wins 2015 R&D 100 award! LDMS Video
- 2015: ASCR awarded Resilience project Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact
OVIS/LDMS can be obtained from github.com/ovis-hpc
- LDMS v4! Available at github site!
- The current distribution includes only the OVIS/LDMS monitoring, transport, and storage components.
Upcoming HPC Monitoring and Analysis Conference Events
- Workshop on Monitoring and Analysis for HPC Systems Plus Applications (HPCMASPA) held in conjunction with IEEE Cluster 2019 in Sept 2019 at Albuquerque, NM USA.
- Monitoring Large-Scale HPC Systems -- collaboration and resource site for HPC Monitoring
- Includes materials from SC18 BoF: Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights