The primary LDMS store plugins support:
Librabbitmq 0.8 based, no changes planned. Tracking of librabbitmq updates expected.
- Remove hard-coded limit on number of instances.
- Extended flush options to manage latency and debugging issues.
Ideas under discussion:
- Timestamping of set arrival at the store. - under consideration
- Possible inclusion of final agg time in the writeout
- It will cost virtually nothing (but storage) to add an option to include a store-processing-date stamp.
- This has been requested by (possibly among others) users concerned with data validation for collective (multinode) statistics. This my be better addressed by changes elsewhere in LDMS. E.g. The LDMS core might provide a service that collects current RTC from all nodes in the aggregation over as short a time as possible and publishes it as a data set (the sampler API does not support this collection activity). A custom "store" could generate a warning if any host clock is off by more than the length of time the scan took (or a small multiple thereof). The usage model of this on Cray's odd clock systems is unclear.
- Duplicate set instance detection (the same set instance arriving by distinct aggregation paths is silently stored twice).
- This can be handled by individual stores keeping a hash of the set instance names and the last timestamp (or N timestamps in a network of aggregators with potentially out of order delivery) of data stored for that set instance. Any set with timestamp detected as already stored is a duplicate. As LDMS is also in the process of adding multiple set instance transport and collection, putting this logic in individual stores is redundant and error prone. The ldms core can/should ensure delivery of exactly one copy from a given origin and date to the stores; this requires a bit of state per store per set instance, and while we're at it this state should include a void pointer for use of the store plugin. This would eliminate most lookups stores may be performing.
- Conflicting schema detection (set instances with schema of same name and different content, result storage loss or silent error).
- The schema conflict detection can be reduced making a metadata checksum at set creation and performing consistency checks at any storage plugin (such as csv) where a full set consistency constraint exists.
- Storage policies/transforms which cherry pick named metrics must search instances by metric name every time (until ldms start managing a void pointer per policy instance per set instance) or must also enforce full set consistency.
- Handle new schema delete old schema. Need this for dvs and perhaps more papi. Handle N schema.
- See also conflicting schema and duplicate set detection. This all gets very easy if we stop looking at store plugins as singletons without per-set-instance state.
- Check handling start/stop/load/unload. Multiple instance support?
- File permissions and naming
- File owner/permissions set at create has been added to 3.4.7 and 4.x.
- Also want YYYYMMDD naming convention instead of epoch.
- CSV has code to handle user-defined-templates for filenames at close/rename; it can be extended to file creation.
- Users want this ability at the start of the file, not just at close/rename.
- Rollover at subday intervals - think option 1 is sufficient for now. Also fixed name would be in alignment with production system usage, so should be considered.
- This could be done instantly just using rollover option 1 with an interval less than 86400 seconds. This would drift unless we add some interval/offset semantics (but in seconds). This has been implemented as rolltype 5.
- Presumably user wants something more cron-like (2 minutes past every third hour since midnight). This would entail either supporting a config file with cron syntax and the names of schema to roll on different schedules.
- It might be better to just refactor the stores to work on fixed filenames and accept a command (with template) via ldmsctl to perform a log-close-and-rename-and-reopen. Actual cron or logrotate can then be used in the standard ways admins know.
- LDMS core managed state pointers (void *) per client(transform policy/store policy)
- Lack of these is making the store and transform APIs very difficult to finish.
- When a set instance is assigned to be used in a plugin instance (of which there may be more than one per plugin), then associated with the (set-instance, plugin-instance) pair must also be a void * storage slot that the plugin is allowed to populate.
- The plugin can hang off that void* (udata of the right flavor), any thing it needs.
- The most obvious being cached data needed to resolve the problems listed above: the generation numbers and checksums of the schema and instance last seen by the store instance for that set instance.
- The next most obvious being a cache of metadata generation number and metric indices wanted for the transform or policy, when the schema which might vary in content under the same name.
Problems observed or suspected: XYZ
SOS is in rapid development, and the corresponding store is tracking it.
Production use of the flatfile store has led to a number of requested changes (below). These changes are sufficiently complicated that an alternately named store (store_var) is in development. The flatfile store will remain unchanged, so that existing production script use can continue per site until admins have time to switch.
- Flush controls to manage latency.
- Output only on change of metric.
- Optionally with heartbeat metric output on specified long interval.
- Output only of specific metrics.
- Excluding output of specific metrics.
- including producername, job id and component id, for single-job, and single-node use-cases.
- Output of rate, delta, or integral delta values.
- Periodic output at frequency lower than arrival, optionally with selectable statistics on suppressed data.
- Statistics: min, max, avg, miss-count, nonzero-count, min-nonzero, sum, time-weighted sum, dt
- Metric name aliasing.
- Rounded to nearest second time stamps (when requested by the user, who is also using long intervals).
- Check and log message (once) if a rail limit is observed.
- Rename file after close/rollover following a template string.
- Generation of splunk input schema.
- Handling of array metrics.