Publications and presentations

From OVISWiki
Jump to: navigation, search


Quick Docs

For an overview of the LDMS architecture and data, see:

Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.

LDMS Version 3 Tutorial and Demo Material
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat
Sandia National Laboratories, SAND2017-5153 O, May 2017.

github.com/ovis-hpc - Release with wiki and man pages for plugins and commands.


For an overview of Baler and its application, see:

Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3
Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.

New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

Publications and Presentations

NOTE: Publications prior to Sept 2011 refer to a different and now deprecated architecture for data collection and transport (i.e., they do NOT use LDMS)

2019

Extracting Actionable System-Application Performance Factors
J. Brandt, A. Gentile, and J. Cook
Minisymposium on Modeling Resource Utilization and Contention in HPC System-Application Interactions -- Minisymposium Organizer
at the SIAM Conf. on Computational Science and Engineering (CSE 19), Feb-Mar 2019.

2018

Platform Independent Run Time HPC Monitoring, Analysis, and Feedback at Any-Scale -- Featured Presentation at DOE Booth
J. Brandt
SC18, Nov 2018.

Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights -- BoF Session Organizer
SC18, Nov 2018.

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun
IEEE Transactions on Parallel and Distributed Systems (Sep 2018) doi: 10.1109/TPDS.2018.2870403

A Methodology for Characterizing the Correspondence Between Real and Proxy Applications
O. Aaziz, J.M. Cook, J. Cook, T. Juedeman, D. Richards, and C. Vaughan
IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Large-Scale System Monitoring Experiences and Recommendations -- Invited Peer-Reviewed Submission
V. Ahlgren et al (29 authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Modeling Expected Application Runtime for Characterizing and Assessing Job Performance
O. Aaziz, J. Cook, and M. Tanash
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Taxonomist: Application Detection through Rich Monitoring Data -- Best Artifact Award
E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun
24th Int'l European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy, Aug 2018.
Artifact

Integrating Low-latency Analysis into HPC System Monitoring
R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev
47th Int'l Conference on Parallel Processing (ICPP), Eugene, OR, Aug 2018.

Cray System Monitoring: Successes, Requirements, Priorities
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, J. Greenseid, A. Greiner, B. Hadri, Y. He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams. (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Cray Users Group (CUG), Stockholm, Sweden. May 2018.

Supporting Failure Analysis with Discoverable, Annotated Log Datasets
S. Leak, A. Greiner, A. Gentile, and J. Brandt
Cray Users Group (CUG), Stockholm, Sweden. May 2018.

Automated Analysis and Effective Feedback -- BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG), May 2018.

Runtime HPC System and Application Performance Assessment and Diagnostics
J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun
Conference on Data Analysis (CODA), Sante Fe, NM, March 2018.

Continuous Performance Tracking for Kokkos using LDMS
J. Brandt, S. Hammond, T. Tucker, A. Gentile, and J. Cook
Programming Models and CoDesign Meeting, Albuquerque, NM. Feb 2018.

2017

Systems Monitoring Data in Action -- BoF Session Organizer
SC17, 12:15pm-1:15 pm Thurs Nov 16 2017

Holistic Measurement Driven System Assessment
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer
Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Sept 2017.

Diagnosing Performance Variations in HPC Applications Using Machine Learning -- Gauss Award Winner [1]
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
ISC High Performance 2017 (ISC), Jun 2017.

Discovering Metrics of Network Contention
JOWOG-34, Los Alamos, NM, June 2017.

LDMS Version 3 Tutorial and Demo Material
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat
Sandia National Laboratories, SAND2017-5153 O, May 2017.

Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo
V. Formicola, S. Jha, F. Deng, D. Chen (UIUC), A. Bonnie, M. Mason (LANL), J. Brandt, A. Gentile (SNL), L. Kaplan, J. Repik (Cray), J, Enos, M. Showerman (NCSA), A. Greiner (NERSC), Z. Kalbarczyk, R. Iyer, and W. Kramer (UIUC)
Cray Users Group (CUG), May 2017.

Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II
A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray)
Cray Users Group (CUG), May 2017.

Holistic Systems Monitoring and Analysis -- BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG), May 2017.

Contention and Congestion: Challenges and Approaches to Understanding Application Impact
A. Gentile, J. Brandt, A. Agelastos, and J. Lamb, K. Ruggirello, and J. Stevenson
Minisymposium on Understanding Performance Variability due to Application-Data Center Interaction [2] [3] -- Minisymposium Organizer
at the SIAM Conf. on Computational Science and Engineering (CSE 17), Feb 2017.

2016

Data Analytics Support for HPC System Management -- Panelist
SC16, Fri 18th Nov 2016 10:30-noon.

Monitoring Large Scale HPC Systems: Understanding, Diagnosis and Attribution of Performance Variation and Issues -- BoF Session Organizer
SC16, 5:15pm-7pm Wed Nov 16 2016

Discovery, Interpretation, and Communication of Meaningful Information in HPC Monitoring Data
Univ. of Central Florida, Oct 2016.

Holistic Measurement Driven Resilience
Chaos Community Day Seattle, WA. Aug. 2016

Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
Parallel Computing (2016), Elsevier B. V., http://dx.doi.org/10.1016/j.parco.2016.05.009

Large-Scale Persistent Numerical Data Source Monitoring System Experiences
J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.

Design and Implementation of a Scalable HPC Monitoring System
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.

Network Performance Counter Monitoring and Analysis on the Cray XC Platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh
Cray Users Group (CUG), May 2016.

Dynamic Model Specific Register (MSR) Data Collection as a System Service
G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman
Cray Users Group (CUG), May 2016.

Design and Implementation of a Scalable HPC Monitoring System for Trinity
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray)
Cray Users Group (CUG), May 2016.

Addressing the Challenges of "Systems Monitoring" Data Flows -- BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Cray Users Group (CUG), May 2016.

Smart HPC Centers: Data, Analysis, Feedback, and Response
J. Brandt, A. Gentile, C. Martin, B. Allan, and K. Devine
Minisymposium on Improving Performance, Throughput, and Efficiency of HPC Centers through Full System Data Analytics [4][5] -- Minisymposium Organizer
at the SIAM Conf. on Parallel Processing for Scientific Computing (PP 16), Paris, France. Apr 2016.

Monitoring High Speed Network Fabrics: Experiences and Needs
J. Brandt, A. Gentile, B. Allan, S. Lefantzi, and M. Aguilar
at Open Fabrics Alliance Workshop, Monterey, CA. Apr 2016

Monitoring Large Scale HPC Platforms: Issues, Approaches, and Experiences
Univ. of Central Florida, Jan 2016.

2015

LDMS receives 2015 R&D100 award (Sandia ceremony)

HPC Monitoring, Understanding, and Performance: Where Less is Less -- Featured Presentation at DOE Booth
J. Brandt
at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.

LDMS Demo at DOE Booth SC15 Nov 2015.

Monitoring Large-Scale HPC Systems: Data Analytics and Insights - BOF Session Organizer
at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.

Infrastructure for In Situ System Monitoring and Application Data Analysis
J. Brandt, K. Devine, and A. Gentile
In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2015)
at IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC15), Austin, TX. Nov 2015.

New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

Extending LDMS to Enable Performance Monitoring in Multi-Core Applications
S. Feldman, D. Zhang, D. Dechev, and J. Brandt
Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
IEEE Int'l. Conf. on Cluster Computing (CLUSTER), Chicago, IL. Sept 2015.

Lightweight Distributed Metric Service Overview
JOWOG-34
LLNL, Livermore, CA. July 2015.

Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity -- Best Paper Finalist
J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde
Cray User's Group (CUG), Chicago, IL. April 2015.

Scalable Integrated High-Fidelity Continuous Monitoring
at System Monitoring of Cray Systems BoF
at Cray User's Group (CUG), Chicago, IL. April 2015.

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping -- Invited Minisymposium Presentation
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
Minisymposium on Topology Mapping and Locality
at the SIAM Conf. on Computational Science and Engineering (CSE 15), Salt Lake City, UT. Mar 2015.

2014

Extreme-scale HPC Monitoring
In Sandia National Laboratories HPC Annual Report 2014.

Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.

Monitoring Large-Scale HPC Systems: Issues and Approaches - BOF Session Organizer
IEEE/ACM Int'l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Monitoring Application Resource Utilization on the Intel PHI Coprocessor - Minitalk
J. Brandt and A. Gentile
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Memory Reliability and Performance Degradation - Minitalk (Extended Abstract)
Benjamin Allan
1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA)
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Lightweight Distributed Metric Service (LDMS): Fast, scalable run-time performance monitoring of HPC systems
JOWOG-34
AWE, Aldermaston, UK, Jun 2014

Large Scale System Monitoring and Analysis on Blue Waters Using OVIS -- Best Paper Finalist
M. Showerman, J. Enos, J. Fullop (NCSA), P. Cassella (Cray), N. Naksinehaboon, N. Taerat, T. Tucker (OGC), J. Brandt, A. Gentile, and B. Allan (SNL)
Cray User's Group (CUG), Lugano, Switzerland. May 2014.

Large Scale HPC Monitoring
New Mexico State University, Las Cruses, NM. April 2014.

2013

Lightweight Data Metric Service (LDMS): Run-time Resource Utilization Monitoring
JOWOG-34
SNL, Albuquerque, NM, Aug 2013

High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6
J. Brandt, T. Tucker, A. Gentile, D. Thompson, V. Kuhns, and J. Repik
Cray User's Group (CUG), Napa Valley, CA. May 2013.

2012

Filtering Log Data: Finding Needles in the Haystack
L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile
42nd Annual IEEE/IFIP Int'l. Conf. on Dependable Systems and Networks (DSN). Boston, MA June 2012.

Report of Experiments and Evidence for ASC L2 Milestone 4467 - Demonstration of a Legacy Application's Path to Exascale
B. Barrett, R. Barrett, J. Brandt, R. Brightwell, M. Curry, N. Fabian, K. Ferreira, A. Gentile, S. Hemmert, S. Kelly, R. Klundt, J. Laros, V. Leung, M. Levenhagen, G. Lofstead, K. Moreland, R. Oldfield, K. Pedretti, A. Rodrigues, D. Thompson, T. Tucker, L. Ward, J. Van Dyke, C. Vaughan, and K. Wheeler
SAND2012-1750. Sandia National Laboratories. March 2012.

2011

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|11 Seattle, WA, November 2011.
- Exhibit ASC Booth 803 -- Demos & talk
- OVIS at Petascale Systems Management BOF -- Invited Panelist

Develop Feedback System for Intelligent Dynamic Resource Allocation to Improve Application Performance
J. Brandt, A. Gentile, D. Thompson and T. Tucker
SAND2011-6301. Sandia National Laboratories. September 2011.

Framework for Enabling System Understanding
J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong
4th Workshop on Resiliency (Resilience) in High Performance Computing
at Euro-Par 2011, Bordeaux, France. August 2011.

Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3
Int'l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.

2010

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|10 New Orleans, LA, November 2010.
- Exhibit ASC Booth Demos
- Exhibit ASC Booth talk: OVIS 3: Scalable Data Collection and Analysis for Large Scale HPC System Understanding

Scalable HPC Monitoring and Analysis for Understanding and Automated Response -- Invited Presentation
HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC
at the Los Alamos Computer Science Symposium, Santa Fe, NM. October 2010.

OVIS 3.2 User's Guide (NB: Deprecated)
J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND 2010-7109, Sandia National Laboratories, October 2010.

Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis
New Mexico State University, NM. October 2010.

Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis -- Invited Presentation
European Grid Initiative (EGI) Technical Forum 2010 Amsterdam, Netherlands. September 2010.

Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
P. Pébay, D. Thompson, and J. Bennett
IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Heraklion, Greece. September 2010.

A Framework for Graph-Based Synthesis, Analysis, and Visualization of HPC Cluster Job Data
J. Brandt, V. De Sapio, A. Gentile, P. Kegelmeyer, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND2010-2400, Sandia National Laboratories, August 2010.

The OVIS analysis architecture (NB: Deprecated)
J. M. Brandt, V. De Sapio, A. C. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. H. Wong
Sandia Report SAND2010-5107, Sandia National Laboratories, July 2010.

The Python command line interface to the OVIS analysis functionality (NB: Deprecated)
J. M. Brandt, A. C. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. H. Wong
Sandia Report SAND2010-4289, Sandia National Laboratories, June 2010.

Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
1st Int'l Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
at the 40th Annual IEEE/IFIP Int'l. Conf. on Dependable Systems and Networks (DSN) Chicago, IL. June 2010.

Scalable Modeling and Analysis for Resilience
J. Brandt, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
JOWOG-34
LLNL, Livermore CA. May 2010.

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids
at the 10th IEEE Int'l. Symposium on Cluster, Cloud, and Grid Computing (CCGRID) Melbourne, Australia. May 2010.

Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
6th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing
at the 24th IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Atlanta, GA. April 2010.

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC -- Invited Minisymposium Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Minisymposium on Vertically Integrated Fault Tolerance for Large-Scale Scientific Computing
at the SIAM Conf. on Parallel Processing and Scientific Computing (PP10), Seattle, WA. Feb 2010.

2009

OVIS and XCR Researchers at LaTech

OVIS in HPC: Information Fusion for Resilience
Louisiana Tech University Host: Box Leangsuksun Ruston, LA. December 2009.

Failure Prediction and Resilience in Large-Scale HPC Platforms
SC|09 Portland, OR, November 2009.
- Exhibit Presentation and Demo

Advanced ParaView Visualization
K. Moreland, J. Ahrens, D. DeMarle, D. Thompson, P. Pébay and N. Fabian
peer-reviewed tutorial on the use of statistics engines at the IEEE VisWeek 2009, Atlantic City, NJ. October 2009.

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box -- Invited Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
at the Los Alamos Computer Science Symposium (LACSS 2009), Santa Fe, NM. October 2009.

Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Workshop on Resiliency in High Performance Computing (Resilience)
at the 18th ACM Int'l. Symposium on High Performance Distributed Computing (HPDC) Munich, Germany. June 2009.

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing -- Best Paper Award
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
5th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing
at the 23rd IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Rome, Italy. May 2009.

OVIS 2.0 User's Guide (Deprecated)
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
SAND 2009-2329, Sandia National Laboratories, April 2009

OVIS: Scalable Real-time Analysis of Very Large Datasets
Overview viewgraph. 2009.

2008

OVIS2: Whole System Monitoring and Analysis - Toward Understanding and Prediction
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|08 Austin, TX. November 2008.
- Exhibit Presentation and Demo

Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing -- Invited Presentation
H. Adalsteinsson, J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pebay, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
at the Los Alamos Computer Science Symposium (LACSS 2008), Santa Fe, NM. October 2008.

OVIS: Scalable, Real-time Statistical Analysis of Very Large Datasets
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
2008 Sandia Workshop on Data Mining and Data Analysis
Extended abstract, SAND Report 2008-6109, Sandia National Laboratories, September 2008.

Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
Workshop on Resiliency in High-Performance Computing (Resilience)
at the 8th IEEE Symposium on Cluster Computing and the Grid (CCGRID) Lyon, France, May 2008.

OVIS-2: A Robust Distributed Architecture for Scalable RAS
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
4th Workshop on System Management Techniques, Processes, and Services (SMTPS)
at the 22nd IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Miami, FL, April 2008.

2007

OVIS-2: A Distributed Framework for Scalable Monitoring and Analysis of Large Computational Clusters
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|07 Reno, NV, November 2007.
- Exhibit Presentation and Demo

2006

Monitoring Computational Clusters with OVIS
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
SAND Report 2006-7939, Sandia National Laboratories, December 2006.

OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, J. Ortega, P. P. Pébay, D. C. Thompson, and M. H. Wong
SC|06 Tampa, FL, November 2006.
- Exhibit Presentation and Demo

OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pébay
The 2nd Workshop on System Monitoring Tools for Large-Scale Parallel Systems (SMTPS)
at the 20th IEEE Int'l. Parallel and Distributed Processing Symposium (IPDPS) Rhodes, Greece, April 2006.

Distributed, Intelligent RAS System for Large Computational Clusters: FactSheet
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
Fact sheet, Sandia National Laboratories, April 2006.

2005

Bayesian Inference for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, Y. M. Marzouk, and P. P. Pébay
SC|05 Seattle, Washington, November 2005.
- Exhibit Presentation, Demo, and Flier
- Conference Poster

Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay
at IEEE Int'l. Conf. on Cluster Computing (CLUSTER) Boston MA, September 2005.

Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay
SAND Report 2005-4558, Sandia National Laboratories, July 2005.

2004

Detection of System Abnormalities Through Behavioral Analysis of ASC Codes
J. M. Brandt and A. C. Gentile
SC|04 Exhibit, Pittsburgh, PA, November 2004.
- Exhibit Demo

2003

Distributed Intelligent RAS System for Large Computational Clusters
J. M. Brandt, N. M. Berry, R. A. Yao, B. M. Tsudama, and A. C. Gentile
SC|03, Phoenix, AZ November 2003.
- Exhibit Demo
- Conference Poster


Dataset Releases

The ASCR funded exascale resilience project Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact releases system datasets in support of resilience research.

2016

Mutrino Dataset 2/15-6/16 (12/16 Release) (About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-12310 O, Dec 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/mutrino1yr-v122016.tgz

Mutrino Dataset 2/15-5/15 (About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-2449 O, Mar 2016
[Online]: http://portal.nersc.gov/project/m888/resilience/datasets/mutrino/logs.051715.cr.tgz