LANL-LDMS

From OVISWiki
Jump to: navigation, search

LDMS

These are instructions for building LDMS v2 on a non-Cray platform from a repository which is no longer supported. Visit github.com/ovis-hpc.

Getting OVIS

Keyless access:
git clone git://hekili.ca.sandia.gov/git-ovis/repositories/ovispublic.git
cd ovispublic
git checkout mergebranch

Building LDMS

The source has a main directory called ovis with the following code subdirectories:

  • lib - some support codes
  • sos - the sos store
  • ldms - the main ldms code

Libevent2 is a prerequisite. If it is not installed on your system it can be found at [libevent.org] A script to illustrate building and installing libevent is in README.libevent2. Ubuntu 12 includes a compatible libevent package.

Example scripts for configure and make are in the packaging subdirectory. You may copy and edit the one closest to your target platform. These must be run from the top-level directory. They will build into a local directory called .build-all. The install directory can be set by setting prefix (current value is /opt/ovis).

By default a selection of samplers that typically reads from /proc source (e.g., /proc/stat, /proc/net/dev, /proc/meminfo, etc)

Complete OVIS build examples for our platforms generally include a machine specific improvement on the ./packaging/configure-all.sh.

Platform specific hints:

  • TOSS2/Redhat 6.x/Centos 6.x/Ubuntu 12: start with ./autogen.sh and then packaging/make-all-top.sh
  • Redhat 5.x: You must first build a bunch of build tools before autogen.sh may work; see packaging/glory-buildauto.sh for the general idea. After autotools are installed, install libevent2, run autogen.sh, then see make-all-glory.sh or make-all-wtb.sh for examples. All major systems will soon be migrated to 6.x-based OS and we do not expect to support 5.x for long.

Testing LDMS in user mode

After building ldms (as a non-root user), a basic test script is generated in .build-all/ldms/scripts/ldms_usertest.sh. It tests ldmsd and aggregation on localhost. Before the test will work, you must be able to "ssh localhost ls" without errors or typing your password/passphrase. "ssh-agent bash" followed by ssh-add may be needed. The output will end with:

ldms_ls on host 3:
localhost3/vmstat
localhost3/meminfo
localhost2/vmstat
localhost2/meminfo
localhost1/vmstat
localhost1/meminfo


Running LDMS in the TLCC environment: Quick Start

This section describes how to configure and run LDMS daemons (ldmsd) to perform the following tasks:

  1. collect data
  2. aggregate data from multiple ldmsds, and
  3. store collected data to files.

There are four basic configurations that will be addressed:

  1. configuring ldmsd
  2. configuring a collector plugin on a running ldmsd
  3. configuring a ldmsd to aggregate information from other ldmsds, and
  4. configuring a flat file storage plugin on an ldmsd.

The order in which these configurations should be performed does not matter with respect to collectors and aggregators.

While a complete listing of flags and parameters can be seen by running ldmsd with the --help directive, this document describes the flags and parameters required for running a basic setup.

There are no run scripts provided in the current release; the commands here can be used in the creation of such.

Start a ldmsd on pinto1

  • Currently must be run be run as root if using the default path of /var/run/ for the unix domain socket. This can be changed using the environment variable LDMSD_SOCKPATH as described below.
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
export PATH=/opt/ovis/sbin/:$PATH
  • A script can be made to start ldmsd and collectors on a host where that script contains the information to execute the command below:
  • Running: <path to executable>/ldmsd -x <transport>:<listen port> -P <# worker threads to start> -S <unix domain socket path/name> -l <log file path/name>
    • transport is one of: sock, rdma, ugni (ugni is Cray specific for using RDMA over the Gemini network)
    • # worker threads defaults to 1 if not specified by -P and should not exceed the core count
    • The unix domain socket is used by ldmsctl to communicate configuration information
      • Note that the default path for this is /var/run/. To change this the environment variable LDMSD_SOCKPATH must be set to the desired path (e.g. export LDMSD_SOCKPATH=/tmp/)
    • The default is to run as a background process but the -F flag can be specified for foreground
    • Examples:
/opt/ovis/sbin/ldmsd -x rdma:60000 -S /var/run/ldmsd/metric_socket_1 -l /opt/ovis/logs/1

Same but sending stdout and stderr to /dev/null

/opt/ovis/sbin/ldmsd -x rdma:60000 -S /var/run/ldmsd/metric_socket_1 -l /opt/ovis/logs/1  > /dev/null 2>&1

Configure a collector

  • Option 1 (on pinto1)
    • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_1
ldmsctl>
  • Now configure "meminfo" collector plugin to collect every second
    • Note: The unit of time specified by interval= is usec hence interval=1000000 defines a one second interval.
ldmsctl> load name=meminfo
ldmsctl> config name=meminfo component_id=1 set=pinto1/meminfo
ldmsctl> start name=meminfo interval=1000000
ldmsctl> quit

NOTE0: 
ldmsctl> help 
will print out info about the ldms commands and options
NOTE1: can use stop name=meminfo followed by start name=meminfo interval=xxx to change collection intervals
NOTE2: Different plugins may have additional configuration parameters. Use help within ldmsctl to see these
NOTE3: The ldmsctl command "info" will output all config information to that ldmsd's log file. E.g:
ldmsctl>info
  • Configure "vmstat" collector plugin to collect every second (1000000 usec)
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_1
ldmsctl> load name=vmstat
ldmsctl> config name=vmstat component_id=1 set=pinto1/vmstat
ldmsctl> start name=vmstat interval=1000000
ldmsctl> quit
  • Option 2 (remote to pinto1 via bash script)
    • Write bash script (e.g. meminfo_collect.sh)
#!/bin/bash
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
LDMSCTL=/opt/ovis/sbin/ldmsctl
  • In bash script configure "meminfo" collector plugin example to collect every second (1000000 usec)
echo load name=meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_1
echo config name=meminfo component_id=1 set=pinto1/meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_1
echo start name=meminfo interval=1000000 | $LDMSCTL -S /var/run/ldmsd/metric_socket_1
  • Make meminfo_collect.sh executable
chmod +x meminfo_collect.sh
  • Execute meminfo_collect.sh remotely
pdsh -w pinto1 <path>/meminfo_collect.sh
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • See Using ldms_ls below

Configure an aggregator

  • Start a ldmsd on a node/service node using "sock" as the listening transport (bugfix in testing corrects the need to use "sock" here)
    • See Start a ldmsd above
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_X
ldmsctl>
  • Now configure ldmsd to collect metric sets from pinto1 and pinto2 every second (1000000 usec) (assumes they listen on port 60020)
ldmsctl> add host=pinto1 type=active interval=1000000 xprt=sock port=60020 sets=pinto1/meminfo
ldmsctl> add host=pinto2 type=active interval=1000000 xprt=sock port=60020 sets=pinto2/meminfo
ldmsctl> quit
Note0: Sets must be specified on the "add host" line; you can add hosts with sets to an aggregator even if those sets are not yet present on the host.
Note1: There is currently no "remove" so if a host should be dropped from the list or have its parameters changed it requires stopping, restarting, and adding with appropriate parameters
Note2: There is no requirement that aggregator intervals match collection intervals
Note3: Because the collection and aggregation processes operate asynchronously there is the potential for duplicate data collection as well as missed samples. The first is handled by the storage plugins by comparing generation numbers and not storing duplicates. The second implies either a loss in fidelity (if collecting counter data) or a loss of data points here and there (if collecting differences of counter values or non counter values).
  • Remote script based configuration would be done in the same manner as that for collection above
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • In this case you should see metric sets for both pinto1 and pinto2 displayed when you query the aggregator ldmsd
    • See Using ldms_ls below

Configure a store_csv storage plugin

  • Start a ldmsd on a node/service node that has access to a storage device using "sock" as the listening transport (bugfix in testing corrects the need to use "sock" here)
    • See Start a ldmsd above
  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
    • Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_Y
ldmsctl>
  • Now configure ldmsd aggregate metric sets from aggregator ldmsds, running on nid00003 and nid00004, every second (1000000 usec) (assumes they listen on port 60020)-- intervals don't need to match
ldmsctl>add host=pinto3 type=active interval=1000000 xprt=sock port=60020
ldmsctl>add host=pinto4 type=active interval=1000000 xprt=sock port=60020
  • Now configure ldmsd to store metric sets being retrieved from pinto3 and pinto4
  • Example using store_csv plugin:
ldmsctl> load name=store_csv
ldmsctl> config name=store_csv path=<path to file where data will be stored> 
ldmsctl> store name=store_csv comp_type=node set=meminfo container=<shortname of file where the data will be stored>
ldmsctl> quit
  • Go to data store and verify files have been created and are being written to
cd <path where data will be stored>/node/<container>
ls -ltr
  • Can now consume this data.


NOTES:

  • Additional metric sets can be stored in csv stores as well. These requires only additional store lines and should have a separate container ala
store name=store_csv comp_type=node set=vmstat container=vmstat
  • You can optionally use "hosts" and "metrics" in the store command to down select what is stored.
  • The format has been changed to include a CompId for every metric that is being stored. There is now the ability to associate a different CompId with each metric, but this is beyond the scope of the quick start.

Using ldms_ls

  • Set environment variables
export LD_LIBRARY_PATH=/opt/ovis/lib/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/opt/ovis/lib/
export LDMSD_PLUGIN_LIBPATH=/opt/ovis/lib/
export PATH=/opt/ovis/sbin/:$PATH
  • Use ldms_ls to query ldmsd on host nid00001 listening on port 60020 using the ugni transport for metric sets being served by that ldmsd
ldms_ls -h pinto1 -x sock -p 60020
  • Should return:
pinto1/meminfo
pinto1/vmstat (if configured)
  • Use ldms_ls to query ldmsd on host pinto1 listening on port 60020 using the sock transport for the names and contents of metric sets being served by that ldmsd
ldms_ls -h pinto1 -x sock-p 60020 -l
  • Should return: Set names (pinto1/meminfo in this case) as well as all names and values associated with each set respectively
ldms_ls -h pinto1 -x sock -p 60020 -l
pinto1/meminfo: consistent, last update: Wed Jul 31 21:51:08 2013 [246540us]
U64 33084652         MemTotal
U64 32092964         MemFree
U64 0                Buffers
U64 49244            Cached
U64 0                SwapCached
U64 13536            Active
U64 39844            Inactive
U64 5664             Active(anon)
U64 13540            Inactive(anon)
U64 7872             Active(file)
U64 26304            Inactive(file)
U64 2996             Unevictable
U64 2988             Mlocked
U64 0                SwapTotal
U64 0                SwapFree
U64 0                Dirty
U64 0                Writeback
U64 7164             AnonPages
U64 6324             Mapped
U64 12544            Shmem
U64 84576            Slab
U64 3948             SReclaimable
U64 80628            SUnreclaim
U64 1608             KernelStack
U64 804              PageTables
U64 0                NFS_Unstable
U64 0                Bounce
U64 0                WritebackTmp
U64 16542324         CommitLimit
U64 73764            Committed_AS
U64 34359738367      VmallocTotal
U64 3467004          VmallocUsed
U64 34356268363      VmallocChunk
U64 0                HugePages_Total
U64 0                HugePages_Free
U64 0                HugePages_Rsvd
U64 0                HugePages_Surp
U64 2048             Hugepagesize
U64 565248           DirectMap4k
U64 5726208          DirectMap2M
U64 27262976         DirectMap1G
  • For a non-existent set
ldms_ls -h pinto1 -x sock -p 60020 -l pinto1/procnfs
ldms_ls: No such file or directory
ldms_ls: lookup failed for set 'pinto1/procnfs'
  • Note: Will need to ctrl-c if lookup failed. Will be fixed.
ldms_ls -h pinto1 -x sock -p 60020 -v will output metadata information
  • If the collectors and frequency are fixed an init script can be utilized for starting and stopping the collectors, aggregators, and store ldmsds

To stop a ldmsd

pdsh -w pinto1 killall ldmsd

This should be followed by a 
pdsh -w pinto1 ps auxw | grep ldmsd
to ensure the daemon was killed
  • If using an init script:
pdsh -w pinto1 /sbin/service <name> stop