Very light secure app monitoring approach
We have a variety of strategies to monitor applications with LDMS in development. Each addresses a subset of use cases, and none of the current ones is conservative enough (or easily configured to be conservative enough) for many app users, developers or administrators. I started a github wiki page cataloging the approaches and hope others fill in any gaps it may have. https://github.com/ovis-hpc/ovis/wiki/Proposal-4
A new approach: For some Sandia apps and admins we want/need job monitoring with the following properties list. I propose that this requires new (relatively simple) sampler development and lightweight application interface library development.
- Provides users/admins/developers/analysts with clues about application state in a format which is human and machine readable.
- Low frequency: expected sampling interval of the order of minutes, not seconds.
- Low local sampling overhead: either the data is small in parse-time terms, or the sampler does not even parse it (defer parse to store or analysis).
- Application users/developers do NOT use code that connects to the system ldmsd (which runs as root) or to any other network database.
- Adding "app instrumentation" in any form does not introduce a dependency on network file systems without explicit opt-in by the user at runtime.
- App users/developers do not depend on using binary shared memory constructs -- any communication with an ldmsd sampler is via ascii text file that is also useful to human admins and users.
- No use of LD_PRELOAD tricks or just-in-time binary instrumentation.
- App developers can write simple calls to an ldms-provided file api that manages one or more size-limited metrics files. Definition via API of the file content must be distributable (and logically incremental) throughout the app code.
- App logic can choose to emit app configuration information.
- App logic can choose to emit app progress information as a set of counters (not a stream of events). The set can evolve in content as the code runs.
- App logic may only data emit on an arbitrary subset of the run nodes, or maybe even on the launch node (which in principle may not be where the app actually runs in parallel).
- "App instrumentation" might be implemented as a separate code which, given the arguments of the soon to be launched real application, parses them and emits data before the real app runs.
- Compatible with multiple jobs on the same compute node.
- Not tied to a specific system resource manager in any way.
- An admin-configured ldmsd sampler can automatically discover user data in canonical locations.
Data desired (though maybe not all collected by this method): Much of this is contemplated in ongoing university work. Italic bits are not.
- job id of predecessor job (if a continuation of simulation time)
- all potentially relevant parameters from input file(s)
- a hint about how to automatically detect if the job is not progressing
- any user-supplied tags
- Job id
- path and timestamp of binary used
- command line options present
- environment variables present (perhaps filtered by whitelist/blacklist)
- what libraries are loaded, if binary is not statically linked and stripped.
Seemingly satisfactory implementation method:
- Provide C library and scripting (python and/or binary programs) to incrementally create and update structured text files in /dev/shm/jobmon/$JOBID/
- Format the text files as TOML.
- Use a lock file for consistency (which pretty much forces use of a memory based file system).
What a C/python api might look like (pseudocode) Functions will be provided to populate and update files such as
/dev/shm/jobmon/$JOBID.config /dev/shm/jobmon/$JOBID.progress /dev/shm/jobmon/$JOBID.env
which are in a defined format. Here $JOBID is an application string which should start with SLURM assigned job id and may then include rank or other app-defined key material. An ldmsd sampler can then scan /dev/shm/jobmon/ for new files and collect data safely.
The library could follow a singleton pattern (global) or support thread-level or object level use. For now assume singleton ldms_watch_ and app is responsible for calling only in rank 0 or equivalent. More object-oriented usage, arrays, and per-rank usage should be obvious extensions.
void ldms_watch_init($JOBID, app_name, app_family) // e.g ("12345", "bobsMDversion", "lammps") void ldms_watch_continuation(previous_job) // tell us the prior job in simulation time void ldms_watch_final() // dump any outstanding data // collecting configuration strings void ldms_watch_config_init(max_data_bytes) // define maximum file size void ldms_watch_config_add_value(key, value) // add a string void ldms_watch_config_add_kvlist(map) // add a list of string key:value pairs void ldms_watch_config_add_group(group_name) // name a subset of strings void ldms_watch_config_add_group_value(group_name, key, value) // add value to a group void ldms_watch_config_add_group_kvlist(group_name, map) // add list to a group void ldms_watch_config_write() // write all config strings to file // collecting progress (rolling unsigned iteration counters) void ldms_watch_progress_init(max_data_bytes) // define maximum file size point = ldms_watch_progress_add("point_name") // name a progress point counter ldms_watch_progress_update(point) // update the counter ldms_watch_progress_write() // write all counters to file lmds_watch_progress_schedule_write(interval, offset) // schedule automatic writing of counters in an independent thread // log significant environment variables void ldms_watch_env(envvar) // add named environment variable to dump void ldms_write_env() // dump all named variables to file
The progress and config init functions accept a maximum file size. If later in the code too many values or progress points are added such that the file size will be exceeded, the excess adds are ignored and an error counter in the output becomes non-zero. It's trivial to adjust this scheme to force error-handling on the application writer if desired and to accept max=0 as a flag that the app writer wants no limitations. The LDMS sampler may, however, impose limitations such that app data is lost.
Feedback from lammps developers
- Include a (bounded) log recording capability rather than exclusively a parameter tabulation; don't know the parameters of interest going into the future.
- Telling laamps folk "lammps-hours" without any configuration info is also of great interest. (we could get this from the BU pid sampler and offline introspection of binaries detected. (nm $binary |grep -i lammps)
- Expect the case of a single slurm job with hundreds of separate app runs inside, include N-cores simultaneous serial runs.
- Make sure it can be defaulted off at the developers discretion.
- Turning it on via an environment variable is acceptable.
- No immediate bite on the "progress" feature
- Provide us a demo code + proposed instrumentation library.
- Performance concerns:
- memory competition (allow strict upper bound)
- cleanup of files when a run ends but after ldms has sampled them. Whose job?
- Impact of instrumentation on runtime performance.
- per node vs per rank reporting.
- Lammps can spawn vasp-- how is that accounted?
- Not per-thread reporting.
- How will this behave in non-sandia environments?
- lammps can periodically replay some or all of an input file; consider mode where only first N log lines are kept.
- Some users may have finer grained privacy concerns (hiding numerical parameters, novel class names)
- this is beyond the scope of a logging library, but in scope for the application writers to set (or tune) policy of log lines written.
- anonymization of logs (obscuring or removing parameters, names, etc) is frequently handled with file filters in post processing rather than app logic.
- all university and gov't systems already have defined privacy expectations and information release policies.
- Need to run design/examples by key non-sandia lammps players.
- Use cases:
- One job, one long (possibly parallel) binary execution.
- One job, one binary execution per core simultaneous.
- One job, possibly hundreds of binary executions per node or per core simultaneous or overlapping.