LDMS FAQ

From OVISWiki
Jump to: navigation, search

Checking Out LDMS Distribution

  • We will be checking out to a directory called "Source"
  • Make "Source" directory in your home directory
mkdir Source
  • Change your working directory to Source
cd Source
  • Download LDMS code into a directory called "ovispublic"
git clone https://github.com/ovis-hpc/ovis.git ovispublic
  • The code is in the master (only) branch

Building LDMS and Support Libs

  • (If applicable) Select gnu compiler
module load PrgEnv-gnu
  • Select autotools
    • On RHEL/Centos 6, the autotools and related macro packages installed in /usr work well (assuming all required devel prerequisites have rpms installed). Other versions of autotools built locally may be missing macros like PKG_CHECK_MODULES and therefore fail during configure due to unexpanded autoconf macros. If this happens, adjust your PATH to use the /usr autotools.

Top Level Build

This will build binaries, libraries, includes, and man pages and put them in your top level directory under LDMS_install

  • Run ./autogen.sh
  • If running RHEL/CENTOS 6.x (and older Fedora):
    • Download and place libevent-2.0.21-stable.tar.gz in the top level directory of your git checkout. (Older libevent versions have bugs that may affect LDMS.) This will cause the unpacking and building of libevent-2.0.21-stable.tar.gz in a .build-event/ subdirectory by packaging/make-all-top.sh.
    • Edit ./packaging/make-all-top.sh and set LOCALEVENT=1 at the top, since only libevent1 is available by default on these platforms.
    • Change expected_event2_prefix=/usr to expected_event2_prefix=$prefix
  • Run ./packaging/make-all-top.sh
    • This checks for gcc 4.6 or later. Some earlier versions can work; change export CC=gcc46 to export CC=gcc to try.
    • This builds a default configuration with samplers for generic *nix systems. You may have to disable some additional features if they are not supported on your system (e.g., infiniband (e.g., --disable-sysclassib) and RDMA (e.g., --disable-rdma) support) or if you do not desire them (e.g., --disable-authentication).
    • This should build all binaries and libraries and place them in LDMS_install which is relocatable.
    • Man pages are in LDMS_install/share/man.
  • Skip to Running LDMS: Quick Start below

Notes on build details:

  • Debian/Ubuntu 12.04 and later supply libevent2 in their package manager that is compatible. On these and similar Debian OS's, leave LOCALEVENT set to 0 at the top of packaging/make-all-top.sh and make sure libevent2 is installed by package manager.
  • If you change CC and CXX definitions in ./packaging/make-all-top.sh and LOCALEVENT is set to 1, then 'rm -rf .build-event' before rerunning make-all-top.sh to ensure a consistent and complete rebuild.

Building Individual Components Separately

Build libevent-2.0

This will build libevent-2.0.21-stable and install into /tmp/opt/ovis

  • Make the directory you will build into
mkdir -p /tmp/opt/ovis
  • Download and untar libevent-2.0.21-stable.tar.gz
  • Change your working directory to the libevent source directory
cd libevent-2.0.21-stable 
  • run autogen.sh
./autogen.sh
  • Create a configure script (the following is an example configure.sh for build in /tmp/opt/ovis)
../configure --prefix=/tmp/opt/ovis/libevent-2.0_build --libdir=/tmp/opt/ovis/lib64
  • Make your configure.sh executable
chmod +x configure.sh
  • Create a build directory
mkdir build
  • Move to your new build directory
cd build
  • Run your configure script
../configure.sh
  • Build and install binaries and libraries (to /tmp/opt/ovis/libevent-2.0_build and /tmp/opt/ovis/lib64)
make && make install

Build LDMS Support Libraries

  • Change your current working directory to the "lib" directory
cd ~/Source/ovispublic/lib
  • run autogen.sh
./autogen.sh
  • Write a configure script (named configure.sh) to build libraries in /tmp/opt/ovis like the following example:
../configure    --prefix=/tmp/opt/ovis/ovis.lib --libdir=/tmp/opt/ovis/lib64 \
               --disable-zap \
               --disable-rpath \
               LDFLAGS="-L/tmp/opt/ovis/lib64" \
               CPPFLAGS="-I/tmp/opt/ovis/libevent-2.0_build/include"
  • Make your configure.sh script executable
chmod +x configure.sh
  • Make a build directory
mkdir build
  • Change your current working directory to your new build directory
cd build
  • Run your configure script from your build directory
../configure.sh
  • Build and install libraries into /tmp/opt/ovis/lib64
make && make install
  • Note: Ignore doxygen errors on install

Build LDMS

  • Change your working directory to ~/Source/ovispublic/ldms
cd ~/Source/ovispublic/ldms
  • run autogen.sh
./autogen.sh
  • Write a configure script to build LDMS and install binaries and libraries in /tmp/opt/ovis
    • Example configure.sh for build in /tmp/opt/ovis
../configure --prefix=/tmp/opt/ovis/ldms.usr --libdir=/tmp/opt/ovis/lib64 \
   --with-ovis-lib=/tmp/opt/ovis/ovis.lib \
   --with-libevent=/tmp/opt/ovis/libevent-2.0_build \
   --disable-mmap --disable-perfevent \
   --enable-libevent --disable-mysql --disable-mysqlbulk \
   --disable-rpath \
   CFLAGS="-g -O0" \
   LDFLAGS="-L/tmp/opt/ovis/lib64" \
   CPPFLAGS="-I/tmp/opt/ovis/libevent-2.0_build/include -I/tmp/opt/ovis.lib/include"
  • Make configure.sh executable
chmod +x configure.sh
  • Make a build directory
mkdir build
  • Change your current working directory to your new build directory
cd build
  • Set LD_LIBRARY_PATH to include your new libraries installed in /tmp/opt/ovis/lib64
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64:$LD_LIBRARY_PATH
  • Run your new configure script
../configure.sh
  • Ignore "chmod: cannot access `scripts/ldms_usertest.sh': No such file or directory"
    • This is QC code that we are not building here
  • Build and install binaries and libraries to /tmp/opt/ovis
make && make install
  • Note: Ignore doxygen errors on install

Running LDMS: Quick Start

This section describes how to configure and run LDMS daemons (ldmsd) to perform the following tasks:

  1. collect data
  2. aggregate data from multiple ldmsds, and
  3. store collected data to files.

There are four basic configurations that will be addressed:

  1. configuring ldmsd
  2. configuring a collector plugin on a running ldmsd
  3. configuring a ldmsd to aggregate information from other ldmsds, and
  4. configuring a store_csv storage plugin on an ldmsd.

The order in which these configurations should be performed does not matter with respect to collectors and aggregators.

While a complete listing of flags and parameters can be seen by running ldmsd with the --help directive or in the man pages, this document describes the flags and parameters required for running a basic setup.

There are no run scripts provided in the current release; the commands here can be used in the creation of such.

NOTE: paths in these subsections are those used in Building the Components Separately (i.e., /tmp/opt/ovis based paths). If you have done the top-level build, adjust accordingly.

Start a ldmsd on your host

  • Currently must be run be run as root if using the default path of /var/run/ for the unix domain socket. This can be changed using the environment variable LDMSD_SOCKPATH as described below.
  • Set environment variables
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
export PATH=/tmp/opt/ovis/sbin/:$PATH
  • A script can be made to start ldmsd and collectors on a host where that script contains the information to execute the command below:
    • Sample are scripts included in your ~/ldms_scripts/ directory.
  • Running: <path to executable>/ldmsd -x <transport>:<listen port> -P <# worker threads to start> -S <unix domain socket path/name> -l <log file path/name>
    • transport is one of: sock, rdma, ugni (ugni is Cray specific for using RDMA over the Gemini network)
    • # worker threads defaults to 1 if not specified by -P and should not exceed the core count
    • The unix domain socket is used by ldmsctl to communicate configuration information to an ldmsd
      • Note0: The default path for this is /var/run/ldmsd/. To change this the environment variable LDMSD_SOCKPATH must be set to the desired path (e.g. export LDMSD_SOCKPATH=/tmp/run/ldmsd)
      • Note1: If authentication checking is enabled you need to specify the location of the shared secret in your environment
export LDMS_AUTH_FILE=/home/foo/mysecret
        • Format of file needs to be: secretword=$%&123foo
      • Note2: Instead of specifying the log file path using the "-l" flag you can use the "-q" flag and not write to a log file. This is typically used on compute nodes for production.
    • The default is to run as a background process but the -F flag can be specified for foreground
    • Examples:
/tmp/opt/ovis/sbin/ldmsd -x sock:60000 -S /var/run/ldmsd/metric_socket -l /tmp/opt/ovis/logs/1

Same but sending stdout and stderr to /dev/null

/tmp/opt/ovis/sbin/ldmsd -x sock:60000 -S /var/run/ldmsd/metric_socket -l /tmp/opt/ovis/logs/1  > /dev/null 2>&1
  • Start 2 instances of ldmsd on your vm
    • Note: Make sure to use different socket names and listen on different ports.
/tmp/opt/ovis/sbin/ldmsd -x sock:60000 -S /var/run/ldmsd/metric_socket_vm1_1 -l /tmp/opt/ovis/logs/vm_1  > /dev/null 2>&1
/tmp/opt/ovis/sbin/ldmsd -x sock:60001 -S /var/run/ldmsd/metric_socket_vm1_2 -l /tmp/opt/ovis/logs/vm_2  > /dev/null 2>&1

Configure collectors on host "vm1" directly via ldmsctl

  • Set environment variables
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
export PATH=/tmp/opt/ovis/sbin:$PATH
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
#Example:
/tmp/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_vm1_1
ldmsctl>
  • Now configure "meminfo" collector plugin to collect every second. Note: interval=<# usec> e.g interval=1000000 defines a one second interval.
ldmsctl> load name=meminfo
ldmsctl> config name=meminfo component_id=1 set=vm1_1/meminfo
ldmsctl> start name=meminfo interval=1000000
ldmsctl> quit

NOTE1: At the ldmsctl> prompt typing "help" will print out info about the ldmsctl commands and options
NOTE2: You can use stop name=meminfo followed by start name=meminfo interval=xxx to change collection intervals
NOTE3: For synchronous operation include "offset=<#usec>" in start line (e.g. start name=meminfo interval=xxx offset=yyy)
       This will cause the sampler to target interval + yyy aligned to the second and micro second (e.g. every 5 seconds with an offset of 
       0 usec would ideally result in collections at 00:00:00, 00:00:05, 00:00:10, etc. whereas with an offset of 100,000 usec
       it would be 00:00:00.1, 00:00:05.1, 00:00:10.1, etc)
NOTE4: Different plugins may have additional configuration parameters. Use help within ldmsctl to see these
NOTE5: At the ldmsctl> prompt typing "info" will output all config information to that ldmsd's log file.
  • Configure "vmstat" collector plugin to collect every second (1000000 usec)
/tmp/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_vm1_1
ldmsctl> load name=vmstat
ldmsctl> config name=vmstat component_id=1 set=vm1_1/vmstat
ldmsctl> start name=vmstat interval=1000000
ldmsctl> quit
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • See Using ldms_ls below

Configure collectors on host "vm1" via bash script

  • The following is an example bash script named "collect.sh"
#!/bin/bash
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
LDMSCTL=/tmp/opt/ovis/sbin/ldmsctl
# Configure "meminfo" collector plugin to collect every second (1000000 usec) on vm1_2
echo load name=meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
echo config name=meminfo component_id=2 set=vm1_2/meminfo | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
echo start name=meminfo interval=1000000 | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
# Configure "vmstat" collector plugin to collect every second (1000000 usec) on vm1_2
echo load name=vmstat | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
echo config name=vmstat component_id=2 set=vm1_2/vmstat | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
echo start name=vmstat interval=1000000 | $LDMSCTL -S /var/run/ldmsd/metric_socket_vm1_2
  • Make collect.sh executable
chmod +x collect.sh
  • Execute collect.sh (Note: When executing this across many nodes you would use pdsh to execute the script on all nodes in parallel)
./collect.sh
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • See Using ldms_ls below

Configure an aggregator

  • See Start a ldmsd above for more information
  • Set environment variables
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
export PATH=/tmp/opt/ovis/sbin:$PATH
  • Start a ldmsd on your vm using "sock" as the listening transport
/tmp/opt/ovis/sbin/ldmsd -x sock:60002 -S /var/run/ldmsd/metric_socket_agg -l /tmp/opt/ovis/logs/vm1_agg  > /dev/null 2>&1
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
Example:
/tmp/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_agg
ldmsctl>
  • Now configure ldmsd to collect metric sets from vm1_1 and vm1_2 every second (1000000 usec) (assumes the collector was configured to listen on port 60020)
ldmsctl> add host=vm1_1 type=active interval=1000000 xprt=sock port=60020 sets=vm1_1/meminfo
ldmsctl> add host=vm1_1 type=active interval=1000000 xprt=sock port=60020 sets=vm1_1/vmstat
ldmsctl> add host=vm1_2 type=active interval=1000000 xprt=sock port=60020 sets=vm1_2/meminfo
ldmsctl> add host=vm1_2 type=active interval=1000000 xprt=sock port=60020 sets=vm1_2/vmstat 
ldmsctl> quit
Note1: Sets must be specified on the "add host" line; you can add hosts with sets to an aggregator even if those sets are not yet present on the host.
Note2: There is currently no "remove" so if a host should be dropped from the list or have its parameters changed it requires stopping, restarting, 
and adding with appropriate parameters
Note3: There is no requirement that aggregator intervals match collection intervals
Note4: Because the collection and aggregation processes operate asynchronously there is the potential for duplicate data collection as well as missed 
samples. The first is handled by the storage plugins by comparing generation numbers and not storing duplicates. The second implies either a loss in 
fidelity (if collecting counter data) or a loss of data points here and there (if collecting differences of counter values or non counter values).
This can be handled using the synchronous option on both collector and aggregator but is not covered here.
  • A script based configuration would be done in the same manner as that for collection above
  • At this point the ldmsd collector should be checked using the utility ldms_ls
    • In this case you should see metric sets for both vm1_1 and vm1_2 displayed when you query the aggregator ldmsd using ldms_ls
    • See Using ldms_ls below

Configure a store_csv storage plugin

  • Configure as ldmsd aggregator on a host that has access to a storage device using "sock" as the listening transport
    • See "Configure an Aggregator" above
  • Set environment variables
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
  • Run ldmsctl -S <unix domain socket path/name associated with target ldmsd>
Example:
/opt/ovis/sbin/ldmsctl -S /var/run/ldmsd/metric_socket_agg
ldmsctl>
  • Configure ldmsd to store metric sets being retrieved from vm1_1 and vm1_2
ldmsctl> load name=store_csv
ldmsctl> config name=store_csv path=~/stored_data  
ldmsctl> store name=store_csv comp_type=node set=meminfo container=meminfo 
ldmsctl> store name=store_csv comp_type=node set=vmstat container=vmstat
ldmsctl> quit
  • Go to data store and verify files have been created and are being written to
cd ~/stored_data/node/<container>
ls -ltr
  • You can now utilize this data.

NOTES:

  • You can optionally use "hosts" and "metrics" in the store command to down select what is stored.
  • The format has been changed to include a CompId for every metric that is being stored. There is now the ability to associate a different CompId with each metric, but this is beyond the scope of the quick start.

Using ldms_ls to display sets/metrics from an ldmsd

  • Set environment variables
export LD_LIBRARY_PATH=/tmp/opt/ovis/lib64/:$LD_LIBRARY_PATH
export LDMS_XPRT_LIBPATH=/tmp/opt/ovis/lib64/
export LDMSD_PLUGIN_LIBPATH=/tmp/opt/ovis/lib64/
export PATH=/tmp/opt/ovis/sbin/:$PATH
  • Query ldmsd on host vm1 listening on port 60000 using the sock transport for metric sets being served by that ldmsd
ldms_ls -h vm1 -x sock -p 60000
  • Should return:
vm1_1/meminfo
vm1_1/vmstat
  • Query ldmsd on host vm1 listening on port 60000 using the sock transport for the names and contents of metric sets being served by that ldmsd
    • Should return: Set names (vm1_1/meminfo and vm1_1/vmstat in this case) as well as all names and values associated with each set respectively
      • Only vm1_1/meminfo shown here
> ldms_ls -h vm1 -x sock-p 60000 -l
vm1_1/meminfo: consistent, last update: Wed Jul 31 21:51:08 2013 [246540us]
U64 33084652         MemTotal
U64 32092964         MemFree
U64 0                Buffers
U64 49244            Cached
U64 0                SwapCached
U64 13536            Active
U64 39844            Inactive
U64 5664             Active(anon)
U64 13540            Inactive(anon)
U64 7872             Active(file)
U64 26304            Inactive(file)
U64 2996             Unevictable
U64 2988             Mlocked
U64 0                SwapTotal
U64 0                SwapFree
U64 0                Dirty
U64 0                Writeback
U64 7164             AnonPages
U64 6324             Mapped
U64 12544            Shmem
U64 84576            Slab
U64 3948             SReclaimable
U64 80628            SUnreclaim
U64 1608             KernelStack
U64 804              PageTables
U64 0                NFS_Unstable
U64 0                Bounce
U64 0                WritebackTmp
U64 16542324         CommitLimit
U64 73764            Committed_AS
U64 34359738367      VmallocTotal
U64 3467004          VmallocUsed
U64 34356268363      VmallocChunk
U64 0                HugePages_Total
U64 0                HugePages_Free
U64 0                HugePages_Rsvd
U64 0                HugePages_Surp
U64 2048             Hugepagesize
U64 565248           DirectMap4k
U64 5726208          DirectMap2M
U64 27262976         DirectMap1G
  • For a non-existent set
ldms_ls -h vm1 -x sock -p 60000 -l vm1_1/foo
ldms_ls: No such file or directory
ldms_ls: lookup failed for set 'vm1_1/foo'
  • Display metadata about sets contained by vm1 ldmsd listening on port 60000
ldms_ls -h vm1 -x sock -p 60000 -v will output metadata information
  • Note: A script can be utilized for starting and stopping the collectors, aggregators, and store ldmsds as presented above

To stop a ldmsd

  • On vm1 to kill all ldmsds
killall ldmsd
  • On vm1 to kill a specific ldmsd
ps aux | grep ldmsd
  • Identify the PID of the particular ldmsd you want to kill (e.g. pid 9999)
kill 9999
  • Follow either of the above with a ps to make sure appropriate ldmsd(s) were killed
  • Note: You can utilize a script to perform the ldmsd process termination

Troubleshooting

What causes the following error: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes?

Running as a user with "max locked memory" set too low

  • The following is an example of trying to run ldms_ls as a user with "max locked memory" set to 32k:
ldms_ls -h <hostname> -x rdma -p <portnum> 
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
   This will severely limit memory registrations.
RDMA: recv_buf reg_mr failed: error 12
ldms_ls: Cannot allocate memory
  • Running the same command as root with "max locked memory" set to 32k works fine -- needs further investigation

Why doesn't my ldmsd start ?

Possible options:

  • Check for exsiting /var/run/ldms/metric_socket or similar
    • Sockets can be left if an ldmsd did not clean up upon termination
      • kill -9 may leave the socket hanging around.
  • The port you are trying to use may already be in use on the node
    • The following shows the logfile output of such a case:
Tue Sep 24 08:36:54 2013: Started LDMS Daemon version 2.1.0
Tue Sep 24 08:36:54 2013: Listening on transport ugni:60020
Tue Sep 24 08:36:54 2013: EV_WARN: Can't change condition callbacks once they have been initialized.
Tue Sep 24 08:36:54 2013: Error 12 listening on the 'ugni' transport.
Tue Sep 24 08:36:54 2013: LDMS Daemon exiting...status 7
  • If using the -l flag make sure that your log directory exists prior to running
  • If writing to a store with this particular lmdsd make sure that your store directory exists prior to running
  • If you are running on a Cray with transport ugni using a user space PTag, check that you called aprun with the -p flag
    • aprun -N 1 -n <number of nodes> -p <ptag name> run_my_ldmsd.sh

How can I find what process is using the port?

netstat -abno

Why arent all my hosts/sets adding to the aggregator?

Possible options:

  • running multiples on the same host from a script -- note that sometimes multiple ldmsctls running concurrently may collide in creating ports. They should clean up after themselves and this usually isnt an issue. Are they supposed to be retrying after a fail?
  • use -m flag on the aggregator to use more memory when adding a lot of hosts
  • use -p on the aggregator to use more processors


What is the syntax for chaining aggregators and storing?

add host chama-rps1 type=active interval=1000000 xprt=sock port=60020 sets=foo/meminfo, foo/vmstat,foo/procnetdev
add host chama-rps1 type=active interval=1000000 xprt=sock port=60020 sets=bar/meminfo, bar/vmstat,bar/procnetdev
load name=store_csv
config name=store_csv path=/projects/ovis/ClusterData/chama/storecsv
store name=store_store_csv comp_type=node set=vmstat container=vmstat
store name=store_store_csv comp_type=node set=meminfo container=meminfo

NOTES:

  • you can do the add host more than once, but only for different prefix on the sets (foo vs bar)
  • syntax for add host is sets plural with comma separation
  • syntax for store is only 1 set at a time
  • csv file will be <path>/<comp_type>/<container>
  • do not mix containers across sets
  • Also: cannot put all the foo and bar in the same line...... (01/09/2013)

Why is my aggregator not responding?

  • Running a ldmsd aggregator as a user but trying to aggregate from a ldmsd that uses a system ptag can result in the aggregator hanging (alive but not responding and not writing to the store). The following is the logfile output of such an aggregator:
Tue Sep 24 08:42:40 2013: Connected to host 'nid00081:60020'
Tue Sep 24 08:42:42 2013: cq_thread_proc: Error 11  monitoring the CQ.