Very light secure app monitoring approach

From OVISWiki
Revision as of 09:33, 3 December 2019 by Baallan (talk | contribs) (Created page with "We have a variety of strategies to monitor applications with LDMS in development. Each addresses a subset of use cases, and none of the current ones is conservative enough (or...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

We have a variety of strategies to monitor applications with LDMS in development. Each addresses a subset of use cases, and none of the current ones is conservative enough (or easily configured to be conservative enough) for many app users, developers or administrators. I started a github wiki page cataloging the approaches and hope others fill in any gaps it may have.

A new approach: For some Sandia apps and admins we want/need job monitoring with the following properties list. I propose that this requires new (relatively simple) sampler development and lightweight application interface library development.


   Provides users/admins/developers/analysts with clues about application state in a format which is human and machine readable.
   Low frequency: expected sampling interval of the order of minutes, not seconds.
   Low local sampling overhead: either the data is small in parse-time terms, or the sampler does not even parse it (defer parse to store or analysis).
   Application users/developers do NOT use code that connects to the system ldmsd (which runs as root) or to any other network database.
   Adding "app instrumentation" in any form does not introduce a dependency on network file systems without explicit opt-in by the user at runtime.
   App users/developers do not depend on using binary shared memory constructs -- any communication with an ldmsd sampler is via ascii text file that is also useful to human admins and users.
   No use of LD_PRELOAD tricks or just-in-time binary instrumentation.
   App developers can write simple calls to an ldms-provided file api that manages one or more size-limited metrics files. Definition via API of the file content must be distributable (and logically incremental) throughout the app code.
   App logic can choose to emit app configuration information.
   App logic can choose to emit app progress information as a set of counters (not a stream of events). The set can evolve in content as the code runs.
   App logic may only data emit on an arbitrary subset of the run nodes, or maybe even on the launch node (which in principle may not be where the app actually runs in parallel).
   "App instrumentation" might be implemented as a separate code which, given the arguments of the soon to be launched real application, parses them and emits data before the real app runs.
   Compatible with multiple jobs on the same compute node.
   Not tied to a specific system resource manager in any way.
   An admin-configured ldmsd sampler can automatically discover user data in canonical locations.

Data desired (though maybe not all collected by this method):

   job id
   path and timestamp of binary used
   command line options present
   environment variables present (perhaps filtered by whitelist/blacklist)
   job id of predecessor job (if a continuation of simulation time)
   all potentially relevant parameters from input file(s)
   a hint about how to automatically detect if the job is not progressing
   any user-supplied tags
   what libraries are loaded, if binary is not statically linked and stripped.

Seemingly satisfactory implementation method: Provide C library and scripting (python and/or binary programs) to incrementally create and update structured text files in /dev/shm/jobmon/$JOBID/ Format the text files as TOML. Use a lock file for consistency (which pretty much forces use of a memory based file system).