For bibliography see https://github.com/ovis-hpc/ovis-publications/wiki.
Downloads
Currently only LDMS is released. Source code and documentation are all hosted at: github.com/ovis-hpc
OVIS Research Data
Public data sets from OVIS-related tools (LDMS, syslog, baler) are indexed here.
The data files are not available on this server due to the data volume.
- Name: skybridge-2019-1
- Format: compressed CSV and other files
- URL: TBD
- Report: Two weeks in the life of skybridge SAND2019-4915.pdf
- Notes: 42GB; contact author to arrange a transfer until url is posted here
Quick Docs
For an overview of the LDMS architecture and data, see:
- Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
- A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker. IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.
- LDMS Version 3 Tutorial and Demo Material
- J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat. Sandia National Laboratories, SAND2017-5153 O, May 2017.
- github.com/ovis-hpc – Release with wiki and main pages for plugins and commands.
For an overview of Baler and its application, see:
- Baler: Deterministic, lossless log message clustering tool N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun. In: Computer Science – Research and Development, Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3. Int’l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.
- New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
- J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.
Dataset Releases
The ASCR-funded exascale resilience project Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact releases system datasets in support of resilience research.
Cielo Fault Injection Dataset 2016
S. Jha, V. Formicola, A. Bonnie, M. Mason, D. Chen, F. Deng, A. Gentile, J. Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and W. Kramer. LA-UR-19-22749, SAND2019-3531 O, Mar 2019.
Mutrino Dataset 2/15-6/16 (12/16 Release) (About)
J. Brandt, A. Gentile, and J. Repik. SAND2016-12310 O, Dec 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/mutrino1yr-v122016.tgz
Mutrino Dataset 2/15-5/15 (About)
J. Brandt, A. Gentile, and J. Repik. SAND2016-2449 O, Mar 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/logs.051715.cr.tgz