Difference between revisions of "Main Page"

From OVISWiki
Jump to: navigation, search
 
(9 intermediate revisions by 3 users not shown)
Line 11: Line 11:
  
 
=== Analysis and Visualization ===
 
=== Analysis and Visualization ===
 +
OVIS data can be used for understanding system state and resource utilization.
 +
The [http://github.com/ovis-hpc/ovis current release version] of OVIS enables in transit calculations of functions of metrics at an aggregator before storing or forwarding data to additional consumers. A more flexible analysis and visualization pipeline is in development. 
  
 +
OVIS has been used for investigation of network congestion evolution in large-scale systems.
 +
 +
[[Image:BW_Cube_still.png|thumb|300px| Investigation of network congestion evolution on NCSA's Blue Waters Gemini Network (27,648 compute nodes)]]
 +
 +
Additional features in development include association of application phases and performance in conjunction with system state data.
 +
 +
<!--
 
[[Image:Screenshot.png||thumb|500px|OVIS 3.2 screenshot]]
 
[[Image:Screenshot.png||thumb|500px|OVIS 3.2 screenshot]]
  
Line 33: Line 42:
 
pane with information relevant to that job, and dropping a job onto the 3D display highlights
 
pane with information relevant to that job, and dropping a job onto the 3D display highlights
 
system values on only those components participating in the job.
 
system values on only those components participating in the job.
 +
 +
-->
  
 
=== Log Message Analysis ===
 
=== Log Message Analysis ===
 
<!-- OVIS includes prototype capabilities for log message searching. Additionally, OVIS analyses include the [[Baler_public|Baler]] tool for log message clustering.-->
 
<!-- OVIS includes prototype capabilities for log message searching. Additionally, OVIS analyses include the [[Baler_public|Baler]] tool for log message clustering.-->
OVIS includes prototype capabilities for log message searching. Additionally, OVIS analyses include the Baler tool for log message clustering.
+
OVIS analyses include the Baler tool for log message clustering.
 
   
 
   
 
=== Decision Support ===
 
=== Decision Support ===
Line 51: Line 62:
 
* <font color="green" size="+1">[[Baler_public|Baler home page]] </font>  
 
* <font color="green" size="+1">[[Baler_public|Baler home page]] </font>  
 
<br><br>
 
<br><br>
 
 
 
 
== OVIS in HPC ==
 
 
In the area of high-performance computing, the long-term goal of OVIS is to enable efficient and reliable computational clusters. We envision a system-wide integration of resource managers (e.g., scheduler), applications, and system resource analysis capabilities. Run-time information on resource utilization and predictive capabilities for anticipated resource needs and component failure can be used by schedulers and applications in order to better allocate resources. For example, information on reliable (or unreliable) system components can be used by the scheduler in making job allocation assignments and further used by applications in order to invoke fault-tolerance mechanisms.
 
 
The OVIS tool for Intelligent Scalable Real-Time Monitoring for Large Computational Clusters was created to address the piece of this goal involving resource analysis and failure prediction.
 
 
=== OVIS: A Tool for Intelligent, Scalable, Real-Time Monitoring of Large Computational Clusters ===
 
 
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds to avoid failure. The OVIS tool for monitoring and analysis of large computational platforms, instead, uses a statistical approach. Leveraging the fact that a cluster is comprised of a large number of similar components, OVIS statistically characterizes the behaviors of single components in the context of the behaviors of the entire set of components. Abnormal or outlier behaviors can be much earlier indicators of problems than threshold-crossing.
 
 
== OVIS 3 ==
 
 
[[Image:Screenshot.png|left|thumb|500px|OVIS 3.2 screenshot]]
 
 
OVIS 3 includes a 3D visual display of deterministic information about state variables
 
(e.g., temperature, CPU Utilization, fan speed), user-generated derived variables
 
(e.g., aggregated memory errors over the life span of a job), and their aggregate statistics.
 
Visual consideration of the cluster as a compartive ensemble, rather than singleton nodes,
 
is a convenient and useful method for tuning cluster set-up and determining the effects
 
of real-time changes in the cluster configuration and its environment.
 
 
OVIS 3 includes a variety of statistical tools to dynamically infer models for the
 
normal behavior of a system and to determine bounds on the probability of values evinced
 
in the system. OVIS stores data in distributed database to provide scalability and fault
 
tolerance. Statistical analyses are then performed in a distributed parallel fashion.
 
 
OVIS 3 includes prototype capabilties for job log searching that can be used to search
 
for events of interest. The OVIS interface has been designed to be highly interactive,
 
where, for example, selection of a job of interest automatically populates an analysis
 
pane with information relevant to that job, and dropping a job onto the 3D display highlights
 
system values on only those components participating in the job.
 
 
OVIS has been used in [[publications_and_presentations| research work]] for runtime invocation of system response to analytically
 
discovered conditions of interest.
 
 
OVIS 3 is currently available for [[downloads_and_documentation| download]].
 
 
 
[[Image:OvisInterface.png|left|thumb|500px|OVIS 2.0 screenshot]]
 
 
<br><br>
 
== Beyond HPC ==
 
 
The OVIS project extends its techniques in large-scale data exploration and statistical data analysis to areas where statistical techniques and scalable data-handling and analysis are required. This includes areas where multiples of components and/or multiples of comparable datasets are appropriate. OVIS has been investigating application to the areas of chemical sensor analysis and large-scale network analysis.
 
 
-->
 
 
== Acknowledgements ==
 
OVIS is a project of [http://www.sandia.gov/ Sandia National Laboratories], Albuquerque NM, 87123
 
and collaborative partner [http://www.opengridcomputing.com/ Open Grid Computing], Austin TX.
 
 
 
SAND 2006-2519W
 

Latest revision as of 12:43, 16 February 2018


OVIS is a modular system for HPC data collection, transport, storage, analysis, visualization, and response. The OVIS project seeks to enable more effective use of High Performance Computational Clusters via greater understanding of applications' use of resources, including the effects of competition for shared resources; discovery of abnormal system conditions; and intelligent response to conditions of interest.

Data Collection, Transport, and Storage

The Lightweight Distributed Metric Service (LDMS) is the OVIS data collection and transport system. LDMS provides capabilities for lightweight run-time collection of high-fidelity data. Data can be accessed on-node or transported off node. Additionally, LDMS can store data in a variety of storage options.

Analysis and Visualization

OVIS data can be used for understanding system state and resource utilization. The current release version of OVIS enables in transit calculations of functions of metrics at an aggregator before storing or forwarding data to additional consumers. A more flexible analysis and visualization pipeline is in development.

OVIS has been used for investigation of network congestion evolution in large-scale systems.

Investigation of network congestion evolution on NCSA's Blue Waters Gemini Network (27,648 compute nodes)

Additional features in development include association of application phases and performance in conjunction with system state data.


Log Message Analysis

OVIS analyses include the Baler tool for log message clustering.

Decision Support

The OVIS project includes research work in determining intelligent response to conditions of interest. This includes dynamic application (re-)mapping based upon application needs and resource state and invocation of resiliency responses upon discovery of potential pre-failure and/or abnormal conditions.

Collaborative Analysis Support

Shaun, a cluster supporting collaboration in HPC data analytics, is coming soon.