SysUsage v2.10 Installation and configuration instruction


REQUIREMENT:
------------

	You must have a moderm perl install. 5.6.0 or more is good.

	You need to install rrdtool. You can find it at:

		http://people.ee.ethz.ch/~oetiker/

	To use the RRDs perl module you must use the following command
	to compile the distrib:

		make site-perl-install

	You also need sar to collect statistics. Sar is part of the sysstat
	package. You can find it here :

		http://freshmeat.net/projects/sysstat/

	If you plan to use thresold warning report you must have Net::SMTP.
	It can always be found at CPAN/ (http://search.cpan.org)

	If you want to send check message to Nagios you need to install
	nsca-2.7.2.tar.gz or a more recent version. You can get it at:
	http://sourceforge.net/project/sysusage/

	If you want to monitor your hard drive temperature you must
	install a small utility called hddtemp. You can download it from
	http://download.savannah.gnu.org/releases/hddtemp/.
	Run it to see if your hard drive have a temperature sensor


INSTALLATION:
-------------

	Simply run the install perl program and answer to the question.

	By default it will copy the perl programs into /usr/bin and
	the HTML output will be done to /var/www/html/sysusage/.
	The configuration file is /etc/sysusage.cfg and all Bekerley DB
	databases generated with rrdtool are done under /var/lib/sysusage.

	Add this line into crontab to run monitoring each minutes:

		*/1 * * * * /usr/bin/sysusage

	Add this other line into crontab to draw graph each 5 minutes:

		*/5 * * * * /usr/bin/sysusagegraph

	If you have change the default installation path of the configuration
	file you need to use option -c with the path of the configuration file.
	Example:

		*/1 * * * * /usr/bin/sysusage -c /usr/local/etc/sysusage.cfg

	Same with sysusagegraph.


CONFIGURATION
-------------

	You can edit the configuration file sysusage.cfg by hand, here is the
	format of the file with all default value. In this file there is three
	section. The first one set the general parameters of the application,
	the second set the parameter used to send SMTP alert at threshold
	exceed and the last configure all type of system information you may
	want to monitor.

		[GENERAL]
		DEBUG       = 0
		DATA_DIR    = /var/lib/sysusage
		PID_FILE    = /etc
		DEST_DIR    = /var/www/html/sysusage
		SAR_BIN     = /usr/bin/sar
		UPTIME      = /usr/bin/uptime
		HOSTNAME    = /bin/hostname
		INTERVAL    = 60
		SKIP        = 12:00/14:00 20:00/06:00
		HDDTEMP_BIN = /usr/local/sbin/hddtemp
		SENSORS_BIN = /usr/bin/sensors

		[ALARM]
		WARN_MODE   = 0
		ALARM_PROG  = /usr/bin/sysusagewarn
		SMTP        = localhost
		FROM        = root@localhost
		TO          = root@localhost
		NAGIOS      = /usr/local/nagios/bin/submit_check_result
		UPPER_LEVEL = 1
		LOWER_LEVEL = 2

		[MONITOR]
		load:threshold_max_value
		cpu:threshold_max_value
		wait:threshold_max_value
		mem:threshold_max_value
		swap:threshold_max_value
		share:threshold_max_value
		sock:threshold_max_value
		io:threshold_max_value
		file:threshold_max_value
		page:threshold_max_value
		pcrea:threshold_max_value
		pswap:threshold_max_value
		net:threshold_max_value
		err:threshold_max_value
		disk:threshold_max_value
		proc:proc_name:threshold_max_value:threshold_min_value
		queue:path_queue_dir:threshold_max_value
		hddtemp:device:threshold_max_value
		dev:device:threshold_max_value
		work:threshold_max_value
		sensors:pattern:threshold_max_value

	Section GENERAL

		DEBUG	= 0|1
			This option is used to set debug mode. If set to 1 then
			sysusage and sysusagegraph just show what they do
			but don't create or send anything.

		DATA_DIR  = /path/to/rrdfiles
			This option is used to set te ouput directory for all
			RRDTOOL database.

		PID_FILE  = /path/to/piddir
			sysusage and sysusagegraph use a file to store the
			pid of the running process to prevent simultaneous run.

		DEST_DIR  = /path/to/html_output
			Set the path to the directory where all HTML and graph
			files should be created.

		SAR_BIN   = /path/to/sar_binary
			sysusage use sar, part of the sysstat distribution to
			grab system information so we need to know where it is.

		UPTIME    = /path/to/uptime_binary
			sysusagegraph report the current uptime of the system
			using the uptime command. Used to set path to uptime	
			binary.

		HOSTNAME  = /path/to/hostname_binary
			All scripts of Sysusage distribution need to know the
			name of the host. They use hostname command for that.

		INTERVAL  = pull_interval_in_second
			All RRDTOOL input use the given interval in second to
			store monitored values. Graph construction also use
			this interval to render things properly. By default
			Sysusage use an interval of 60 seconds to have a better
			statistic report. You can change this but it's not
			recommanded. If you change this adjust your crontab to
			the samse value.

		SKIP      = HH:MM/HH:MM HH:MM/HH:MM ...
			You can define here some time range where alarms should
			not be sent. Value is a list of begin_time/end_time
			separate by space or tabulation. Let's say you don't
			want to have alarm reported during the night for some
			good reason, you can write it like that: 20:00/06:00

		HDDTEMP_BIN = /path/to/hddtemp_binary
			You can monitor your hard drive temperature if you have
			installed hddtemp utility. We need to know the path to
			hddtemp binary.
			
		SENSORS_BIN = /path/to/sensors_binary
			You can monitor your device temperature if you have
			installed lm_sensor utility. We need to know the path to
			sensors binary.
			
	Section ALARM

		WARN_MODE   = 0|1
			Used to disable/enable alert message during threshold
			exceed.

		ALARM_PROG  = /path/to/sysusagewarn
			Used to set path to the external program responsible of
			sending alarm message. You can change it to your own,
			just take a look at the sysusagewarn usage to see what
			command line options are used by sysusage

		SMTP        = smtp.server.net
			Name or Ip address of the SMTP server to contact.
			Default is none => No smtp message to be sent. 

		FROM        = sender@localhost
			Sender email addresse to use in the SMTP message.

		TO          = destination@localhost
			Destination email address where the alarm message will
			be sent.

		NAGIOS      = /usr/local/nagios/bin/submit_check_result
			Path to the external nsca program used to send check
			message to Nagios. Setting this will activate nagios
			check report. See at end of this file to see how to
			configure Nagios

		UPPER_LEVEL = 1
			Nagios check level to send when a high threshold limit
			is reached. Default is 1 => WARNING. This is when the
			load average is too high, this could cause lost of
			performance :-)

		LOWER_LEVEL = 2
			Nagios check level to send when a low threshold limit
			is reached. Default is 2 => CRITICAL. This is when a
			monitored daemon is down so it could be critical :-)



	Section MONITOR

	This section has two different format the first one is used to specify
	most of the monitoring target:

		type:threshold_max:threshold_min

	'type'

		Type of system information you may want to monitor. It can takes
		16 differents values:

		load  => monitor load average
		cpu   => monitor cpu(s) total/system/user usage
		wait  => monitor cpu(s) iowait/idle/steal usage
		mem   => monitor memory usage
		swap  => monitor swap usage
		share => monitor /dev/shm usage
		sock  => monitor number of open socket
		io    => monitor I/O request and block usage
		page  => monitor I/O page usage
		pswap => monitor I/O page swap usage
		pcrea => monitor number of process created per second
		file  => monitor percentage of open file regarding file-max
		net   => monitor I/O network bytes on all network interfaces
		err   => monitor bad packet, drop and collision on interfaces
		disk  => monitor disk space usage
		work  => monitor amount of memory needed for current workload

		Note: last version of sysstat doesn't report the percentage of
		open file, then sysusage try to compute the percentage with
		/proc/sys/fs/file-max. If it doesn't exists it simply report
		the number of open file.
		%file-sz is no longer displayed by sar since the upper limit
		for the number of open files will self-scale with 2.6 kernels.

	'threshold_max'

		This is the maximum threshold value. Any value equal or upper
		than this one will generate SMTP and/or Nagios alert if you
		have enable it.

	'threshold_min'

		This is the minimum threshold value. Any value equal or lower
		of this one will generate SMTP and/or Nagios alert if you have
		enable it. Minima threshold should certainly only be used with
		'proc' monitoring type. If you set it to 0 then you will be
		warn if any of the monitored process are down.

	The second format is used to monitor running process, hard drive
	temperature or queue directory. It has the following format:

		type:target:threshold_max_value:threshold_min_value

	'type'

		Type of system information you may want to monitor. It can takes
		these differents values:

			proc    => monitor number of running process
			queue   => monitor number of files in a directory
			hddtemp => monitor hard drive temperature
			sensors => monitor device (cpu temp, fan speed, etc.)
			dev     => monitor CPU usage per device (ex: sda)

		For sensor see chapter 'SENSORS' for usage.

	'target'

		If type is 'proc' this represent the name of the process to
		monitor. If type is 'queue' this represent the path of the
		directory to monitor. If type is 'hddtemp' this represent the
		hard drive device, ex: /dev/sda. If this is 'dev' this represent
		the device name, ex: sda.

	'threshold_max'

		This is the maximum threshold value. Any value equal or upper
		will generate an SMTP and/or Nagios alert if you have enable it.

	'threshold_min'

		This is the minimum threshold value. Any value equal or lower
		of this one will generate SMTP and/or Nagios alert if you have
		enable it. Min threshold should certainly only be used with
		'proc' monitoring type. If you set it to 0 then you will be
		warn if any of the monitored process are down.

	There's a special case for disk usage monitoring that allow exclusion
	of some mount point. This is usefull if you have hard link or some
	special device you don't need to monitor.

		disk:ThresholdMax:exclusion

	where exclusion is a semicolon (;) separated list of mount point to
	exclude from monitoring.


NAGIOS CONFIGURATION
--------------------

	SysUsage send check message to Nagios through an external command
	(submit_check_result). So you need to create the host and associate
	all sysusage service that you want to monitor with Nagios. The services
	name correspond to the type of monitoring. For example, if you have
	enable alarm on memory usage the service sent is 'mem'. There's also
	specials case with type of monitoring with multiple instance like
	network monitoring. You need to create a service per instance. For
	example type 'net' will have 'net_eth0' and 'net_lo' and more if you
	have more network interface. To see if your sysusage alarm messages
	are well understood by Nagios take a look at the nagios.log file
	(default to /usr/local/nagios/var/nagios.log).

	To desactivate automatically an alarm reported to Nagios, SysUsage
	will send each time it run an OK request if every thing is correct
	for the monitored type.

SENSORS
-------
Monitoring of sensor output is based on regexp. To be clear enought here
an example:

Sensors output on my server:

	adt7463-i2c-0-2d
	Adapter: SMBus I801 adapter at 1480
	V1.5:        +3.23 V  (min =  +0.00 V, max =  +3.32 V)   
	VCore:       +1.24 V  (min =  +1.10 V, max =  +1.49 V)   
	V3.3:        +3.33 V  (min =  +2.80 V, max =  +3.78 V)   
	V5:          +4.99 V  (min =  +4.25 V, max =  +5.75 V)   
	V12:         +0.11 V  (min =  +0.00 V, max = +15.94 V)   
	CPU_Fan:       0 RPM  (min =    0 RPM)
	fan2:       10671 RPM  (min = 8095 RPM)
	fan3:          0 RPM  (min =    0 RPM)
	fan4:          0 RPM  (min =    0 RPM)
	CPU Temp:    +69.5C  (low  =  +2.0C, high = +91.0C)  
	Board Temp:  +32.5C  (low  =  +2.0C, high = +83.0C)  
	Remote Temp: +31.2C  (low  =  +2.0C, high = +58.0C)  
	cpu0_vid:   +1.338 V

	adt7463-i2c-0-2e
	Adapter: SMBus I801 adapter at 1480
	V1.5:        +3.21 V  (min =  +0.00 V, max =  +3.32 V)   
	VCore:       +1.28 V  (min =  +1.10 V, max =  +1.49 V)   
	V3.3:        +3.32 V  (min =  +2.80 V, max =  +3.78 V)   
	V5:          +4.95 V  (min =  +0.00 V, max =  +6.64 V)   
	V12:         +0.11 V  (min =  +0.00 V, max = +15.94 V)   
	CPU_Fan:    10843 RPM  (min = 8095 RPM)
	fan2:          0 RPM  (min =    0 RPM)
	fan3:       9642 RPM  (min = 8095 RPM)
	fan4:          0 RPM  (min =    0 RPM)
	CPU Temp:    +57.2C  (low  =  +2.0C, high = +91.0C)  
	Board Temp:  +35.2C  (low  =  +2.0C, high = +91.0C)  
	Remote Temp: +35.8C  (low  =  +2.0C, high = +58.0C)  
	cpu0_vid:   +1.338 V


Following the sensors kernel module load you could have more or less output
than that. To monitor all sensors CPUs temperature I need to add the following
lines into sysusage.cfg:

	sensors:CPU Temp:75
	sensors:Board Temp:45
	sensors:Remote Temp:45

This will create 3 graphs based on lines matching 'CPU Temp', an other with
lines matching 'Board Temp' and the last with lines matching 'Remote Temp'.
As I have 2 CPUs for each graph there will be 2 values. You can not report
more than 2 values per graph, this is hard coded into sysusage. So if you
have more CPUs you will not see more than 2 values. Here it will sent alarm
when temperature exceed the given values.

To monitor fan speed, I just add lines like this in the configuration file:

	sensors:fan2:11000:8095
	sensors:fan3:11000:8095

This whil create 2 graphs for fan 2 and fan 3. With an alarm sent when speed
exceed 11000 RPM or is lower than 8095 RPM.


That's all...

--
Gilles Darold
