LANL-LDMS

This website contains archival information. For updates, see https://github.com/ovis-hpc/ovis-wiki/wiki

LDMS

These are instructions for building LDMS v2 on a non-Cray platform from a repository which is no longer supported. Visit github.com/ovis-hpc.

Getting OVIS

Keyless access: git clone git://hekili.ca.sandia.gov/git-ovis/repositories/ovispublic.git cd ovispublic git checkout mergebranch

Building LDMS

The source has a main directory called ovis with the following code subdirectories:

  • lib – some support codes
  • sos – the sos store
  • ldms – the main ldms code

Libevent2 is a prerequisite. If it is not installed on your system it can be found at [libevent.org] A script to illustrate building and installing libevent is in README.libevent2. Ubuntu 12 includes a compatible libevent package.

Example scripts for configure and make are in the packaging subdirectory. You may copy and edit the one closest to your target platform. These must be run from the top-level directory. They will build into a local directory called .build-all. The install directory can be set by setting prefix (current value is /opt/ovis).

By default a selection of samplers that typically reads from /proc source (e.g., /proc/stat, /proc/net/dev, /proc/meminfo, etc)

Complete OVIS build examples for our platforms generally include a machine specific improvement on the ./packaging/configure-all.sh.

Platform specific hints:

  • TOSS2/Redhat 6.x/Centos 6.x/Ubuntu 12: start with ./autogen.sh and then packaging/make-all-top.sh
  • Redhat 5.x: You must first build a bunch of build tools before autogen.sh may work; see packaging/glory-buildauto.sh for the general idea. After autotools are installed, install libevent2, run autogen.sh, then see make-all-glory.sh or make-all-wtb.sh for examples. All major systems will soon be migrated to 6.x-based OS and we do not expect to support 5.x for long.

Testing LDMS in User Mode

After building ldms (as a non-root user), a basic test script is generated in .build-all/ldms/scripts/ldms_usertest.sh. It tests ldmsd and aggregation on localhost. Before the test will work, you must be able to "ssh localhost ls" without errors or typing your password/passphrase. "ssh-agent bash" followed by ssh-add may be needed. The output will end with:

ldms_ls on host 3: localhost3/vmstat localhost3/meminfo localhost2/vmstat localhost2/meminfo localhost1/vmstat localhost1/meminfo

Running LDMS in the TLCC Environment: Quick Start

This section describes how to configure and run LDMS daemons (ldmsd) to perform the following tasks:

  1. collect data
  2. aggregate data from multiple ldmsds, and
  3. store collected data to files.

There are four basic configurations that will be addressed:

  1. configuring ldmsd
  2. configuring a collector plugin on a running ldmsd
  3. configuring a ldmsd to aggregate information from other ldmsds, and
  4. configuring a flat file storage plugin on an ldmsd.

The order in which these configurations should be performed does not matter with respect to collectors and aggregators.

While a complete listing of flags and parameters can be seen by running ldmsd with the –help directive, this document describes the flags and parameters required for running a basic setup.

There are no run scripts provided in the current release; the commands here can be used in the creation of such.

Start a ldmsd on pinto1

  • Currently must be run be run as root if using the default path of /var/run/ for the unix domain socket. This can be changed using the environment variable LDMSD_SOCKPATH as described below.
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/ export PATH=/opt/ovis/sbin/:$PATH
  • A script can be made to start ldmsd and collectors on a host where that script contains the information to execute the command below:
  • Running: <path to executable>/ldmsd -x <transport>:<listen port> -P <# worker threads to start> -S <unix domain socket path/name> -l <log file path/name>
    • transport is one of: sock, rdma, ugni (ugni is Cray specific for using RDMA over the Gemini network)
    • # worker threads defaults to 1 if not specified by -P and should not exceed the core count
    • The unix domain socket is used by ldmsctl to communicate configuration information
      • Note that the default path for this is /var/run/. To change this the environment variable LDMSD_SOCKPATH must be set to the desired path (e.g. export LDMSD_SOCKPATH=/tmp/)
    • The default is to run as a background process but the -F flag can be specified for foreground
    • Examples:
/opt/ovis/sbin/ldmsd -x rdma:60000 -S /var/run/ldmsd/metric_socket_1 -l /opt/ovis/logs/1

Same but sending stdout and stderr to /dev/null

/opt/ovis/sbin/ldmsd -x rdma:60000 -S /var/run/ldmsd/metric_socket_1 -l /opt/ovis/logs/1  > /dev/null 2>&1

Configure a collector

  • Option 1 (on pinto1)
    • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_1 ldmsctl>
  • Now configure "meminfo" collector plugin to collect every second
    • Note: The unit of time specified by interval= is usec hence interval=1000000 defines a one second interval.
ldmsctl> load name=meminfo ldmsctl> config name=meminfo component_id=1 set=pinto1/meminfo ldmsctl> start name=meminfo interval=1000000 ldmsctl> quit  NOTE0:  ldmsctl> help  will print out info about the ldms commands and options NOTE1: can use stop name=meminfo followed by start name=meminfo interval=xxx to change collection intervals NOTE2: Different plugins may have additional configuration parameters. Use help within ldmsctl to see these NOTE3: The ldmsctl command "info" will output all config information to that ldmsd's log file. E.g: ldmsctl>info
  • Configure "vmstat" collector plugin to collect every second (1000000 usec)
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_1 ldmsctl> load name=vmstat ldmsctl> config name=vmstat component_id=1 set=pinto1/vmstat ldmsctl> start name=vmstat interval=1000000 ldmsctl> quit
  • Option 2 (remote to pinto1 via bash script)
    • Write bash script (e.g. meminfo_collect.sh)
#!/bin/bash export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/ LDMSCTL=/opt/ovis/sbin/ldmsctl
  • In bash script configure "meminfo" collector plugin example to collect every second (1000000 usec)
echo load name=meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_1 echo config name=meminfo component_id=1 set=pinto1/meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_1 echo start name=meminfo interval=1000000 | $LDMSCTL -S /var/run/ldmsd/metric_socket_1
  • Make meminfo_collect.sh executable
chmod +x meminfo_collect.sh
  • Execute meminfo_collect.sh remotely
pdsh -w pinto1 <path>/meminfo_collect.sh
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • See Using ldms_ls below

Configure an aggregator

  • Start a ldmsd on a node/service node using "sock" as the listening transport (bugfix in testing corrects the need to use "sock" here)
    • See Start a ldmsd above
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_X ldmsctl>
  • Now configure ldmsd to collect metric sets from pinto1 and pinto2 every second (1000000 usec) (assumes they listen on port 60020)
ldmsctl> add host=pinto1 type=active interval=1000000 xprt=sock port=60020 sets=pinto1/meminfo ldmsctl> add host=pinto2 type=active interval=1000000 xprt=sock port=60020 sets=pinto2/meminfo ldmsctl> quit Note0: Sets must be specified on the "add host" line; you can add hosts with sets to an aggregator even if those sets are not yet present on the host. Note1: There is currently no "remove" so if a host should be dropped from the list or have its parameters changed it requires stopping, restarting, and adding with appropriate parameters Note2: There is no requirement that aggregator intervals match collection intervals Note3: Because the collection and aggregation processes operate asynchronously there is the potential for duplicate data collection as well as missed samples. The first is handled by the storage plugins by comparing generation numbers and not storing duplicates. The second implies either a loss in fidelity (if collecting counter data) or a loss of data points here and there (if collecting differences of counter values or non counter values).
  • Remote script based configuration would be done in the same manner as that for collection above
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • In this case you should see metric sets for both pinto1 and pinto2 displayed when you query the aggregator ldmsd
    • See Using ldms_ls below

Configure a store_csv storage plugin

  • Start a ldmsd on a node/service node that has access to a storage device using "sock" as the listening transport (bugfix in testing corrects the need to use "sock" here)
    • See Start a ldmsd above
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_Y ldmsctl>
  • Now configure ldmsd aggregate metric sets from aggregator ldmsds, running on nid00003 and nid00004, every second (1000000 usec) (assumes they listen on port 60020)– intervals don’t need to match
ldmsctl>add host=pinto3 type=active interval=1000000 xprt=sock port=60020 ldmsctl>add host=pinto4 type=active interval=1000000 xprt=sock port=60020
  • Now configure ldmsd to store metric sets being retrieved from pinto3 and pinto4
  • Example using store_csv plugin:
ldmsctl> load name=store_csv ldmsctl> config name=store_csv path=<path to file where data will be stored>  ldmsctl> store name=store_csv comp_type=node set=meminfo container=<shortname of file where the data will be stored> ldmsctl> quit
  • Go to data store and verify files have been created and are being written to
cd <path where data will be stored>/node/<container> ls -ltr
  • Can now consume this data.

NOTES:

  • Additional metric sets can be stored in csv stores as well. These requires only additional store lines and should have a separate container ala
store name=store_csv comp_type=node set=vmstat container=vmstat
  • You can optionally use "hosts" and "metrics" in the store command to down select what is stored.
  • The format has been changed to include a CompId for every metric that is being stored. There is now the ability to associate a different CompId with each metric, but this is beyond the scope of the quick start.

Using ldms_ls

  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH export LDMS_XPRT_LIBPATH=/opt/ovis/lib/ export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/ export PATH=/opt/ovis/sbin/:$PATH
  • Use ldms_ls to query ldmsd on host nid00001 listening on port 60020 using the ugni transport for metric sets being served by that ldmsd
ldms_ls -h pinto1 -x sock -p 60020
  • Should return:
pinto1/meminfo pinto1/vmstat (if configured)
  • Use ldms_ls to query ldmsd on host pinto1 listening on port 60020 using the sock transport for the names and contents of metric sets being served by that ldmsd
ldms_ls -h pinto1 -x sock-p 60020 -l
  • Should return: Set names (pinto1/meminfo in this case) as well as all names and values associated with each set respectively
ldms_ls -h pinto1 -x sock -p 60020 -l pinto1/meminfo: consistent, last update: Wed Jul 31 21:51:08 2013 [246540us] U64 33084652         MemTotal U64 32092964         MemFree U64 0                Buffers U64 49244            Cached U64 0                SwapCached U64 13536            Active U64 39844            Inactive U64 5664             Active(anon) U64 13540            Inactive(anon) U64 7872             Active(file) U64 26304            Inactive(file) U64 2996             Unevictable U64 2988             Mlocked U64 0                SwapTotal U64 0                SwapFree U64 0                Dirty U64 0                Writeback U64 7164             AnonPages U64 6324             Mapped U64 12544            Shmem U64 84576            Slab U64 3948             SReclaimable U64 80628            SUnreclaim U64 1608             KernelStack U64 804              PageTables U64 0                NFS_Unstable U64 0                Bounce U64 0                WritebackTmp U64 16542324         CommitLimit U64 73764            Committed_AS U64 34359738367      VmallocTotal U64 3467004          VmallocUsed U64 34356268363      VmallocChunk U64 0                HugePages_Total U64 0                HugePages_Free U64 0                HugePages_Rsvd U64 0                HugePages_Surp U64 2048             Hugepagesize U64 565248           DirectMap4k U64 5726208          DirectMap2M U64 27262976         DirectMap1G
  • For a non-existent set
ldms_ls -h pinto1 -x sock -p 60020 -l pinto1/procnfs ldms_ls: No such file or directory ldms_ls: lookup failed for set 'pinto1/procnfs'
  • Note: Will need to ctrl-c if lookup failed. Will be fixed.
ldms_ls -h pinto1 -x sock -p 60020 -v will output metadata information
  • If the collectors and frequency are fixed an init script can be utilized for starting and stopping the collectors, aggregators, and store ldmsds

To stop a ldmsd

pdsh -w pinto1 killall ldmsd  This should be followed by a  pdsh -w pinto1 ps auxw | grep ldmsd to ensure the daemon was killed
  • If using an init script:
pdsh -w pinto1 /sbin/service <name> stop