Publications & Presentations Archive

This website contains archival information. For updates, see https://github.com/ovis-hpc/ovis-wiki/wiki

Note: Publications prior to Sept 2011 refer to a different and now deprecated architecture for data collection and transport (i.e., they do NOT use LDMS).

2020

Measuring Congestion in High-Performance Datacenter Networks
S. Jha, A. Gentile, J. Brandt, A. Patke, B. Lim, G. Bauer, M. Showerman, L. Kaplan, Z. Kalbarczyk, W. Kramer, and R. Iyer at The 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Feb 2020.

2019

Enabling Machine Learning-based HPC Performance Diagnostics in Production Environments – Panel Organizer
SC19, Fri 11/22 8:30 AM Nov 2019

Proxy or Imposter? A Method and Case Study to Determine the Answer
O. Aaziz, J. Cook, C. Vaughan, and D. Richards. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l Conference on Cluster Computing (CLUSTER), Sep 2019.

Standardized Environment for Monitoring Heterogeneous Architectures
C. Brown, B. Schwaller, N. Gauntt, B. Allan and K. Davis. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l Conference on Cluster Computing (CLUSTER), Sep 2019.

A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
S. Jha, A. Patke, J. Brandt, A. Gentile, M. Showerman, E. Roman, Z. Kalbarczyk, and R. Iyer. at 26th Symposium on High Performance Interconnects (HOTI), Aug 2019.

HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations
E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun. at Int’l Conf. on Parallel Processing (ICPP). Aug 2019.

Production Application Performance Data Streaming for System Monitoring
R. Izadpanah, B. Allan, D. Dechev, and J. Brandt. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS). Vol 4 Issue 2, Jun 2019 doi: 10.1145/3319496

Exploring New Monitoring and Analysis Capabilities on Cray’s Software Preview System
J. Brandt, C. Brown, S. Donoho, A. Gentile, J. Greenseid, W. Kramer, P. Langer, A. Rashid, K. Rehm, and M. Showerman
at Cray User Group (CUG) 2019. May 2019.

Extracting Actionable System-Application Performance Factors
J. Brandt, A. Gentile, and J. Cook. Minisymposium on Modeling Resource Utilization and Contention in HPC System-Application Interactions – Minisymposium Organizer at the SIAM Conf. on Computational Science and Engineering (CSE 19), Feb-Mar 2019.

Holistic Measurement Driven System Assessment (HMDSA) — poster
Bill Kramer, Greg Bauer, Brett Bode, Mike Showerman, Jeremy Enos, Aaron Saxton, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer (NCSA/UIUC) and James Brandt and Ann Gentile (SNL). at 2019 Exascale Computing Project Annual Meeting, Jan 2019, and HMDSA Project Website

2018

Platform Independent Run Time HPC Monitoring, Analysis, and Feedback at Any-Scale — Featured Presentation at DOE Booth
J. Brandt. SC18, Nov 2018.

Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights — BoF Session Organizer. SC18, Nov 2018.

An Efficient Latch-free Database Index Based on Multi-dimensional Lists
K. Lamar, R. Izadpanah, J. Brandt, and D. Dechev. 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC). Nov 2018.

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun. IEEE Transactions on Parallel and Distributed Systems (Sep 2018) doi: 10.1109/TPDS.2018.2870403

A Methodology for Characterizing the Correspondence Between Real and Proxy Applications
O. Aaziz, J.M. Cook, J. Cook, T. Juedeman, D. Richards, and C. Vaughan. IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Large-Scale System Monitoring Experiences and Recommendations — Invited Peer-Reviewed Submission
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, M. Gienger, J. Greenseid, A. Greiner, B. Hadri, Y. (Helen) He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray). Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer. Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Modeling Expected Application Runtime for Characterizing and Assessing Job Performance
O. Aaziz, J. Cook, and M. Tanash. Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sep 2018.

Taxonomist: Application Detection through Rich Monitoring Data — Best Artifact Award
E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun. 24th Int’l European Conference on Parallel and Distributed Computing (Euro-Par), Turin, Italy, Aug 2018. Artifact

Integrating Low-latency Analysis into HPC System Monitoring
R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev. 47th Int’l Conference on Parallel Processing (ICPP), Eugene, OR, Aug 2018.

Cray System Monitoring: Successes, Requirements, Priorities
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, J. Greenseid, A. Greiner, B. Hadri, Y. He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams. (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray). Cray Users Group (CUG), Stockholm, Sweden. May 2018.

Supporting Failure Analysis with Discoverable, Annotated Log Datasets
S. Leak, A. Greiner, A. Gentile, and J. Brandt. Cray Users Group (CUG), Stockholm, Sweden. May 2018.

Automated Analysis and Effective Feedback — BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile. Cray Users Group (CUG), May 2018.

Runtime HPC System and Application Performance Assessment and Diagnostics
J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun. Conference on Data Analysis (CODA), Santa Fe, NM, March 2018.

Continuous Performance Tracking for Kokkos using LDMS
J. Brandt, S. Hammond, T. Tucker, A. Gentile, and J. Cook. Programming Models and CoDesign Meeting, Albuquerque, NM. Feb 2018.

2017

Systems Monitoring Data in Action — BoF Session Organizer. SC17, 12:15pm-1:15 pm Thurs Nov 16 2017

Holistic Measurement Driven System Assessment
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer. Workshop on Monitoring and Analysis of High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Sept 2017.

Diagnosing Performance Variations in HPC Applications Using Machine LearningGauss Award Winner. O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun. ISC High Performance 2017 (ISC), Jun 2017.

Discovering Metrics of Network Contention
JOWOG-34, Los Alamos, NM, June 2017.

LDMS Version 3 Tutorial and Demo Material
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat. Sandia National Laboratories, SAND2017-5153 O, May 2017.

Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo
V. Formicola, S. Jha, F. Deng, D. Chen (UIUC), A. Bonnie, M. Mason (LANL), J. Brandt, A. Gentile (SNL), L. Kaplan, J. Repik (Cray), J, Enos, M. Showerman (NCSA), A. Greiner (NERSC), Z. Kalbarczyk, R. Iyer, and W. Kramer (UIUC). Cray Users Group (CUG), May 2017.

Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II (and slides)
A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray). Cray Users Group (CUG), May 2017.

Holistic Systems Monitoring and Analysis — BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile. Cray Users Group (CUG), May 2017.

Contention and Congestion: Challenges and Approaches to Understanding Application Impact
A. Gentile, J. Brandt, A. Agelastos, and J. Lamb, K. Ruggirello, and J. Stevenson. Minisymposium on Understanding Performance Variability due to Application-Data Center Interaction [2] [3] — Minisymposium Organizer at the SIAM Conf. on Computational Science and Engineering (CSE 17), Feb 2017.

2016

Data Analytics Support for HPC System Management – Panelist
SC16, Fri 18th Nov 2016 10:30-noon.

Monitoring Large Scale HPC Systems: Understanding, Diagnosis and Attribution of Performance Variation and Issues — BoF Session Organizer. SC16, 5:15pm-7pm Wed Nov 16 2016

Discovery, Interpretation, and Communication of Meaningful Information in HPC Monitoring Data
Univ. of Central Florida, Oct 2016.

Holistic Measurement Driven Resilience
Chaos Community Day Seattle, WA. Aug. 2016

Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson. Parallel Computing (2016), Elsevier B. V., http://dx.doi.org/10.1016/j.parco.2016.05.009

Large-Scale Persistent Numerical Data Source Monitoring System Experiences
J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.

Design and Implementation of a Scalable HPC Monitoring System
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Chicago, IL. May 2016.

Network Performance Counter Monitoring and Analysis on the Cray XC Platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh. Cray Users Group (CUG), May 2016.

Dynamic Model Specific Register (MSR) Data Collection as a System Service
G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman. Cray Users Group (CUG), May 2016.

Design and Implementation of a Scalable HPC Monitoring System for Trinity
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray). Cray Users Group (CUG), May 2016.

Addressing the Challenges of "Systems Monitoring" Data Flows — BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile. Cray Users Group (CUG), May 2016.

Smart HPC Centers: Data, Analysis, Feedback, and Response
J. Brandt, A. Gentile, C. Martin, B. Allan, and K. Devine. Minisymposium on Improving Performance, Throughput, and Efficiency of HPC Centers through Full System Data Analytics [4][5] — Minisymposium Organizer at the SIAM Conf. on Parallel Processing for Scientific Computing (PP 16), Paris, France. Apr 2016.

Monitoring High Speed Network Fabrics: Experiences and Needs
J. Brandt, A. Gentile, B. Allan, S. Lefantzi, and M. Aguilar at Open Fabrics Alliance Workshop, Monterey, CA. Apr 2016

Monitoring Large Scale HPC Platforms: Issues, Approaches, and Experiences
Univ. of Central Florida, Jan 2016.

2015

HPC Monitoring, Understanding, and Performance: Where Less is Less — Featured Presentation at DOE Booth
J. Brandt at IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.

LDMS Demo at DOE Booth SC15 Nov 2015.

Image of Randd2015-pic
LDMS receives 2015 R&D100 award (Sandia ceremony)

Monitoring Large-Scale HPC Systems: Data Analytics and Insights – BOF Session Organizer at IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC15) Austin, TX. Nov 2015.

Infrastructure for In Situ System Monitoring and Application Data Analysis
J. Brandt, K. Devine, and A. Gentile. In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2015)
at IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC15), Austin, TX. Nov 2015.

New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

Extending LDMS to Enable Performance Monitoring in Multi-Core Applications
S. Feldman, D. Zhang, D. Dechev, and J. Brandt. Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Chicago, IL. Sept 2015.

Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson. IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Chicago, IL. Sept 2015.

Lightweight Distributed Metric Service Overview
JOWOG-34. LLNL, Livermore, CA. July 2015.

Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on TrinityBest Paper Finalist
J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde. Cray User’s Group (CUG), Chicago, IL. April 2015.

Scalable Integrated High-Fidelity Continuous Monitoring
at System Monitoring of Cray Systems BoF. at Cray User’s Group (CUG), Chicago, IL. April 2015.

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping — Invited Minisymposium Presentation
J. Brandt, K. Devine, A. Gentile, and K. Pedretti. Minisymposium on Topology Mapping and Locality at the SIAM Conf. on Computational Science and Engineering (CSE 15), Salt Lake City, UT. Mar 2015.

2014

Extreme-scale HPC Monitoring
In Sandia National Laboratories HPC Annual Report 2014.

Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker. IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.

Monitoring Large-Scale HPC Systems: Issues and Approaches – BOF Session Organizer
IEEE/ACM Int’l. Conf. for High Performance Storage, Networking, and Analysis (SC14) New Orleans, LA. Nov 2014.

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping. J. Brandt, K. Devine, A. Gentile, and K. Pedretti. 1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Monitoring Application Resource Utilization on the Intel PHI Coprocessor – Minitalk. J. Brandt and A. Gentile. 1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Memory Reliability and Performance Degradation – Minitalk (Extended Abstract). Benjamin Allan. 1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Madrid, Spain. Sept 2014.

Lightweight Distributed Metric Service (LDMS): Fast, scalable run-time performance monitoring of HPC systems
JOWOG-34. AWE, Aldermaston, UK, Jun 2014

Large Scale System Monitoring and Analysis on Blue Waters Using OVIS — Best Paper Finalist
M. Showerman, J. Enos, J. Fullop (NCSA), P. Cassella (Cray), N. Naksinehaboon, N. Taerat, T. Tucker (OGC), J. Brandt, A. Gentile, and B. Allan (SNL). Cray User’s Group (CUG), Lugano, Switzerland. May 2014.

Large Scale HPC Monitoring
New Mexico State University, Las Cruces, NM. April 2014.

2013

Lightweight Data Metric Service (LDMS): Run-time Resource Utilization Monitoring
JOWOG-34. SNL, Albuquerque, NM, Aug 2013

High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6
J. Brandt, T. Tucker, A. Gentile, D. Thompson, V. Kuhns, and J. Repik. Cray User’s Group (CUG), Napa Valley, CA. May 2013.

2012

Filtering Log Data: Finding Needles in the Haystack. L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile. 42nd Annual IEEE/IFIP Int’l. Conf. on Dependable Systems and Networks (DSN). Boston, MA June 2012.

Report of Experiments and Evidence for ASC L2 Milestone 4467 – Demonstration of a Legacy Application’s Path to Exascale. B. Barrett, R. Barrett, J. Brandt, R. Brightwell, M. Curry, N. Fabian, K. Ferreira, A. Gentile, S. Hemmert, S. Kelly, R. Klundt, J. Laros, V. Leung, M. Levenhagen, G. Lofstead, K. Moreland, R. Oldfield, K. Pedretti, A. Rodrigues, D. Thompson, T. Tucker, L. Ward, J. Van Dyke, C. Vaughan, and K. Wheeler. SAND2012-1750. Sandia National Laboratories. March 2012.

2011

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis. SC|11 Seattle, WA, November 2011.

  • Exhibit ASC Booth 803 — Demos & talk
  • OVIS at Petascale Systems Management BOF — Invited Panelist

Develop Feedback System for Intelligent Dynamic Resource Allocation to Improve Application Performance. J. Brandt, A. Gentile, D. Thompson and T. Tucker. SAND2011-6301. Sandia National Laboratories. September 2011.

Framework for Enabling System Understanding
J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong. 4th Workshop on Resiliency (Resilience) in High Performance Computing at Euro-Par 2011, Bordeaux, France. August 2011.

Baler: Deterministic, lossless log message clustering tool. N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun. In: Computer Science – Research and Development. Volume 26, Numbers 3-4, 285-295, DOI: 10.1007/s00450-011-0155-3. Int’l. Supercomputing Conference (ISC). Hamburg, Germany. June 2011.

2010

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis. SC|10 New Orleans, LA, November 2010.

  • Exhibit ASC Booth Demos
  • Exhibit ASC Booth talk: OVIS 3: Scalable Data Collection and Analysis for Large Scale HPC System Understanding

Scalable HPC Monitoring and Analysis for Understanding and Automated Response — Invited Presentation. HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC at the Los Alamos Computer Science Symposium, Santa Fe, NM. October 2010.

OVIS 3.2 User’s Guide (NB: Deprecated). J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong. SAND 2010-7109, Sandia National Laboratories, October 2010.

Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis. New Mexico State University, NM. October 2010.

Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis — Invited Presentation. European Grid Initiative (EGI) Technical Forum 2010 Amsterdam, Netherlands. September 2010.

Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases. P. Pébay, D. Thompson, and J. Bennett. IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Heraklion, Greece. September 2010.

A Framework for Graph-Based Synthesis, Analysis, and Visualization of HPC Cluster Job Data. J. Brandt, V. De Sapio, A. Gentile, P. Kegelmeyer, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong. SAND2010-2400, Sandia National Laboratories, August 2010.

The OVIS analysis architecture (NB: Deprecated). J. M. Brandt, V. De Sapio, A. C. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. H. Wong. Sandia Report SAND2010-5107, Sandia National Laboratories, July 2010.

The Python command line interface to the OVIS analysis functionality (NB: Deprecated). J. M. Brandt, A. C. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. H. Wong. Sandia Report SAND2010-4289, Sandia National Laboratories, June 2010.

Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example. J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. 1st Int’l Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) at the 40th Annual IEEE/IFIP Int’l. Conf. on Dependable Systems and Networks (DSN) Chicago, IL. June 2010.

Scalable Modeling and Analysis for Resilience. J. Brandt, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. JOWOG-34. LLNL, Livermore CA. May 2010.

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems. J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids at the 10th IEEE Int’l. Symposium on Cluster, Cloud, and Grid Computing (CCGRID) Melbourne, Australia. May 2010.

Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation. J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. 6th Workshop on System Management Techniques, Processes, and Services (SMTPS) – Special Focus on Cloud Computing at the 24th IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Atlanta, GA. April 2010.

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC — Invited Minisymposium Presentation. J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. Minisymposium on Vertically Integrated Fault Tolerance for Large-Scale Scientific Computing at the SIAM Conf. on Parallel Processing and Scientific Computing (PP10), Seattle, WA. Feb 2010.

2009

OVIS in HPC: Information Fusion for Resilience. Louisiana Tech University Host: Box Leangsuksun Ruston, LA. December 2009.

Image of Group-pic
OVIS and XCR Researchers at LaTech

Failure Prediction and Resilience in Large-Scale HPC Platforms. SC|09 Portland, OR, November 2009. Exhibit Presentation and Demo.

Advanced ParaView Visualization. K. Moreland, J. Ahrens, D. DeMarle, D. Thompson, P. Pébay and N. Fabian. Peer-reviewed tutorial on the use of statistics engines at the IEEE VisWeek 2009, Atlantic City, NJ. October 2009.

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box — Invited Presentation. J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. Workshop on Resiliency for Petascale HPC at the Los Alamos Computer Science Symposium (LACSS 2009), Santa Fe, NM. October 2009.

Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study. J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. Workshop on Resiliency in High Performance Computing (Resilience) at the 18th ACM Int’l. Symposium on High Performance Distributed Computing (HPDC) Munich, Germany. June 2009.

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing — Best Paper Award. J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. 5th Workshop on System Management Techniques, Processes, and Services (SMTPS) – Special Focus on Cloud Computing at the 23rd IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Rome, Italy. May 2009.

OVIS 2.0 User’s Guide (Deprecated). J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong. SAND 2009-2329, Sandia National Laboratories, April 2009.

OVIS: Scalable Real-time Analysis of Very Large Datasets. Overview viewgraph. 2009.

2008

OVIS2: Whole System Monitoring and Analysis – Toward Understanding and Prediction. J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong. SC|08 Austin, TX. November 2008. Exhibit Presentation and Demo.

Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing — Invited Presentation
H. Adalsteinsson, J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pebay, D. Thompson, and M. Wong. Workshop on Resiliency for Petascale HPC at the Los Alamos Computer Science Symposium (LACSS 2008), Santa Fe, NM. October 2008.

OVIS: Scalable, Real-time Statistical Analysis of Very Large Datasets. J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong. 2008 Sandia Workshop on Data Mining and Data Analysis. Extended abstract, SAND Report 2008-6109, Sandia National Laboratories, September 2008.

Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems. J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong. Workshop on Resiliency in High-Performance Computing (Resilience) at the 8th IEEE Symposium on Cluster Computing and the Grid (CCGRID) Lyon, France, May 2008.

OVIS-2: A Robust Distributed Architecture for Scalable RAS. J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong. 4th Workshop on System Management Techniques, Processes, and Services (SMTPS) at the 22nd IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Miami, FL, April 2008.

2007

OVIS-2: A Distributed Framework for Scalable Monitoring and Analysis of Large Computational Clusters
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong. SC|07 Reno, NV, November 2007. Exhibit Presentation and Demo.

2006

Monitoring Computational Clusters with OVIS. J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong. SAND Report 2006-7939, Sandia National Laboratories, December 2006.

OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters. J. M. Brandt, A. C. Gentile, J. Ortega, P. P. Pébay, D. C. Thompson, and M. H. Wong. SC|06 Tampa, FL, November 2006. Exhibit Presentation and Demo.

OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters. J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pébay. The 2nd Workshop on System Monitoring Tools for Large-Scale Parallel Systems (SMTPS) at the 20th IEEE Int’l. Parallel and Distributed Processing Symposium (IPDPS) Rhodes, Greece, April 2006.

Distributed, Intelligent RAS System for Large Computational Clusters: Factsheet. J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong. Fact sheet, Sandia National Laboratories, April 2006.

2005

Bayesian Inference for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, Y. M. Marzouk, and P. P. Pébay. SC|05 Seattle, Washington, November 2005.

  • Exhibit Presentation, Demo, and Flyer
  • Conference Poster

Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay at IEEE Int’l. Conf. on Cluster Computing (CLUSTER) Boston MA, September 2005.

Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay. SAND Report 2005-4558, Sandia National Laboratories, July 2005.

2004

Detection of System Abnormalities Through Behavioral Analysis of ASC Codes
J. M. Brandt and A. C. Gentile. SC|04 Exhibit, Pittsburgh, PA, November 2004. Exhibit Demo.

2003

Distributed Intelligent RAS System for Large Computational Clusters
J. M. Brandt, N. M. Berry, R. A. Yao, B. M. Tsudama, and A. C. Gentile. SC|03, Phoenix, AZ November 2003.

  • Exhibit Demo
  • Conference Poster