Secure Windows Monitoring with Zenoss

Starting with version 2.3.x, Zenoss can monitor computers running Microsoft Windows with a variety of data collection protocols: SNMP, WMI over DCOM/MS-RPC and Perfmon over MS-RPC.

In Zenoss Core, the status of Windows services and the Windows Event Log are monitored using Windows Management Instrumentation (WMI) queries over the DCOM/MS-RPC protocol. In the implementation of MS-RPC that Zenoss is based upon, authentication credentials are sent to the remote server using the Windows Challenge / Response (NTLM) mechanism. Using this authentication mechanism, the actual password is never sent across the network, but rather the server produces a “challenge” value that the client must calculate using the password rather than sending it across the network.

NTLM authentication is the same mechanism that Windows devices themselves use for client/server communications, such as file sharing and remote administration.

Starting with version 2.3.x, Zenoss Enterprise gathers Perfmon data using the remote Windows registry API over the MS-RPC protocol. This technique is both more efficient and secure than the previous one. The same authentication mechanism used by Zenoss’s WMI library is used here, providing the same level of security.

Prior to version 2.3.x, Zenoss Enterprise used a different mechanism to collect Perfmon data from Windows devices. This mechanism used a utility known as winexe to remotely execute commands on the Windows device (in this case, the typeperf.exe Windows utility). Unfortunately, the winexe utility sends the username and password used for authentication across the network in clear text, providing a less than ideal configuration for security.

Zenoss users monitoring Windows devices should be running version 2.3.3 or newer for the best possible security when communicating with those devices.

Getting a Native Code Stack Trace from a Zenoss Daemon

Zenoss uses the Python programming language for the vast majority of its code, and all of the daemons and commands that run are Python scripts. Several daemons also make use of native code (i.e. code written in languages like C or C++ that must be compiled into object files and organized into libraries) to perform functions such as remote Windows and SNMP connectivity.

Occasionally, one of these daemons crash in these native libraries, and not in the actual Python code. When this happens, the Python interpreter is unable to produce a relatively friendly stack trace that it would for pure Python code. For example, a crash in a Python script would produce something that looks familiar to most programmers:

Traceback (most recent call last):
File "test.py", line 5, in ?
z = y / x
ZeroDivisionError: integer division or modulo by zero

By contrast, if you have a crash inside of a native library you likely would not see anything more than Bus Error or a similar message, and often nothing at all — the daemon process will just exit. For example, here we have a dynamic library written in C with a single function: doit. This function will attempt to access a NULL pointer when called, which results in the following output:

$ python -i
Python 2.4.4 (#1, Feb 23 2009, 09:17:03) 
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctypes import *
>>> from ctypes.util import *
>>> lib = CDLL(find_library('test'))
>>> lib.doit()
Bus error

To get a stack trace from the native code, the Python interpreter must be run from the GNU debugger, or gdb. All of the Zenoss daemons share a common architecture in how they are started, so the process for running the daemon from within gdb will be similar no matter which daemon you use.

  1. Determine your ZENHOME directory location by running echo $ZENHOME from the shell prompt. In the remainder of this example, we will assume it is set to /Users/cgibbons/zenoss
  2. Pick the daemon you want to run. In this example, we will use zenwin — the daemon responsible for monitoring the state of services on Windows devices.
  3. Look at the actual daemon script:
    $ cat $ZENHOME/bin/zenwin
    #! /usr/bin/env bash
    #############################################################################
    # This program is part of Zenoss Core, an open source monitoring platform.
    # Copyright (C) 2007, Zenoss Inc.
    #
    # This program is free software; you can redistribute it and/or modify it
    # under the terms of the GNU General Public License version 2 as published by
    # the Free Software Foundation.
    #
    # For complete information please visit: http://www.zenoss.com/oss/
    #############################################################################
     
    . $ZENHOME/bin/zenfunctions
     
    PRGHOME=$ZENHOME/Products/ZenWin
    PRGNAME=zenwin.py
    CFGFILE=$CFGDIR/zenwin.conf
     
    generic "$@"
  4. Run gdb in the python interpreter:
    $ gdb python
    GNU gdb 6.3.50-20050815 (Apple version gdb-962) (Sat Jul 26 08:14:40 UTC 2008)
    Copyright 2004 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you are
    welcome to change it and/or distribute copies of it under certain conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB.  Type "show warranty" for details.
    This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .... done
     
    (gdb)
  5. Set the program arguments. Note how the zenwin script above is used to build the actual argument string:
    (gdb) set args /Users/cgibbons/zenoss/Products/ZenWin/zenwin.py --configfile=/Users/cgibbons/zenoss/etc/zenwin.conf run -v10 -c
  6. Finally, run the daemon process within the debugger:
    (gdb) run
    Starting program: /Users/cgibbons/zenoss/bin/python /Users/cgibbons/zenoss/Products/ZenWin/zenwin.py --configfile=/Users/cgibbons/zenoss/etc/zenwin.conf run -v10 -c

The daemon will then run as if it were started directly from the command-line. Any pdb trace statements will still be activated and you can use pdb commands as expected. But, once a native code crash is detected by the debugger, the gdb prompt will be provided and the gdb where command may be used to view the native code stack trace. For example, if we do this with our previous doit test, we’ll see this output:

>>> lib.doit()
 
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x93ebe457 in __vfprintf ()
(gdb) where
#0  0x93ebe457 in __vfprintf ()
#1  0x93ef2da7 in vfprintf_l ()
#2  0x93f17fbb in printf ()
#3  0x003b9ffe in doit () at test.c:6
#4  0x0039476d in .LCFI1 () at /Users/cgibbons/src/zenoss/trunk/inst/build/ctypes-1.0.1/source/libffi/src/x86/darwin.S:81
#5  0x00394701 in ffi_call (cif=0xbffff1a8, fn=0x3b9fe6 <doit>, rvalue=0xa0414584, avalue=0xbffff120) at /Users/cgibbons/src/zenoss/trunk/inst/build/ctypes-1.0.1/source/libffi/src/x86/ffi_darwin.c:249
#6  0x0038f21e in _CallProc (pProc=0x3b9fe6 <doit>, argtuple=0x15a030, flags=4097, argtypes=0x0, restype=0x21b600, checker=0x0) at source/callproc.c:665
#7  0x00389f02 in CFuncPtr_call (self=0x174880, inargs=0x15a030, kwds=0x0) at source/_ctypes.c:3357
#8  0x00007e12 in PyObject_Call (func=0x174880, arg=0x15a030, kw=0x0) at Objects/abstract.c:1795
#9  0x00080dcb in do_call [inlined] () at Python/ceval.c:3776
#10 0x00080dcb in PyEval_EvalFrame (f=0x209960) at Python/ceval.c:3591
#11 0x0008327f in PyEval_EvalCodeEx (co=0x1ac5e0, globals=0x173a50, locals=0x173a50, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2741
#12 0x00083547 in PyEval_EvalCode (co=0xbffff064, globals=0xbffff064, locals=0xbffff064) at Python/ceval.c:484
#13 0x000aae36 in PyRun_InteractiveOneFlags (fp=0xbffff064, filename=0xdb4a6 "<stdin>", flags=0xbffff818) at Python/pythonrun.c:1287
#14 0x000aaf73 in PyRun_InteractiveLoopFlags (fp=0xa04175e0, filename=0xdb4a6 "<stdin>", flags=0xbffff818) at Python/pythonrun.c:706
#15 0x000abe29 in PyRun_AnyFileExFlags (fp=0xa04175e0, filename=0xdb4a6 "<stdin>", closeit=0, flags=0xbffff818) at Python/pythonrun.c:669
#16 0x000b5f8a in Py_Main (argc=0, argv=0xbffff8a4) at Modules/main.c:493
#17 0x00001d0b in _start ()
#18 0x00001c39 in start ()
(gdb)

If the native library was built with debugging symbols a nice programmer-friendly stack trace will be generated like in the above example. Here we can see exactly what line our doit function crashed at. Now the bug should be easy to find and fix, right?

new MacBook setup

I bought another Mac today, a nice 2.4 GHz 13-inch unibody MacBook. I had planned on buying a 17-inch unibody MacBook Pro, and very nearly did, but luckily sanity won out and I remembered how much of a hassle it was to carry around those giant things, even if they are only “only” 6.6 lbs.

I had a 2.0 GHz 13-inch MacBook a couple of years ago when the first Intel-based models came out and I do remember the screen resolution, while not abundant, was more than adequate for browsing, e-mail and even development. And, the unibody model is only 4.5 lbs, so 2 lbs lighter will be a lot nicer to carry around. I also sprung for a spare battery to try and help get a little closer to the awesome battery life of the 17-inch model.

Of course, the obvious question is why buy another laptop when I’ve already got a nice 15-inch MacBook Pro that work provides? The answer there is easy: I don’t want to do anything personal, even development, on the work provided machine.

Now, on to the actual system setup, documented here for posterity.

  1. After account creation, run software update and get all the latest updates installed first.
  2. Create an applecare account with an easy, but secure, password. This way the Apple store geeks can have that account should there need to be any repair work done.
  3. Change the battery lifetime display with Show -> Time.
  4. Secure the screensaver by using System Preferences -> Security -> General and checking the Require password to wake this computer from sleep or screen saver option.
  5. Disable the Front Row remote by using System Preferences -> Security -> General and checking the Disable remote control infrared receiver option.
  6. Disable the Front Row keyboard shortcut by using System Preferences -> Keyboard & Mouse -> Keyboard Shortcuts and disabling the Hide and show Front Row shortcut.
  7. Enable full keyboard shortcuts by checking the All controls option in the bottom of the same Keyboard Shortcuts screen.
  8. Enable the Use secure virtual memory option in System Preferences -> Security -> General.
  9. Encrypt my home directory with System Preferences -> Security -> FileVault.
  10. Install Growl 1.1.4 from http://growl.info/
    1. Install the GrowlSafari extra package.
    2. Install the HardwareGrowler extra package.
      1. Drag HardwareGrowler.app to /Applications
      2. Disable the HardwareGrowler dock icon by following the instructions at http://growl.info/documentation/hardwaregrowler.php
      3. Add HardwareGrowler to the start at login list by using System Preferences -> Accounts -> Login Items and dragging HardwareGrowler to the list.
    3. Enable Growl starting at login with System Preferences -> Growl and enabling the Start Growl at login option.
  11. Remove unused printer drivers by deleting the appropriate folders in /Library/Printers folder (everything but Brother, hp and PPDs in my case).
  12. Install the XcodeTools package from the Installation DVD’s Optional Installs directory.
  13. Drag Xcode to the dock by going to /Developer/Applications and dragging the icon to the dock.
  14. Add Activity Monitor to the dock by going to /Applications/Utilities and dragging the icon to the dock. Seondary-click on the icon and enable Open at Login.
  15. Add Terminal to the dock by going to /Applications/Utilities and dragging the icon to the dock.
    1. Change the default Terminal settings by starting Terminal.app, selecting Preferences (Cmd-,) and then changing the “new window with settings” to Pro.
    2. Select the Pro scheme in the Settings tab and click default.
    3. Choose the Window tab with the Pro scheme selected, click the Background color chooser and set the opacity level to 90%.
    4. Change the window size to 80 columns and 36 rows.
  16. Customize vim by creating ~/.vimrc with the following content:

    :color elflord
    :syntax enable
    :set shiftwidth=4
    :set expandtab
    :set autoindent
    :set cindent
    :set enc=utf-8
    :set nu
    :set showmatch
    :set laststatus=2
    :set nocompatible
    :set gfn=Monaco:h15:a
  17. Enable color highlighting for ls by adding the following lines to /etc/bashrc:

    alias ls='ls -CFG'
    alias dir='ls -FGlas'
  18. Install the Safari 4 beta from http://www.apple.com/safari/download
  19. Install Firefox 3 from http://getfirefox.com/ and drag it to the dock.
  20. Install iStat pro from http://www.islayer.com/apps/istatpro/
  21. Install MySQL 5.1 x86 community edition from http://dev.mysql.com/downloads/mysql/5.1.html be sure to install the StartupItem package as well as the preference pane.
  22. Add MySQL to the shell profile by appending the following to /etc/bashrc:

    export PATH=/usr/local/mysql/bin:$PATH
  23. Install EverNote from http://www.evernote.com/
  24. Install DropBox from http://www.getdropbox.com/
  25. Install the Windows Media Components for QuickTime from http://www.microsoft.com/windows/windowsmedia/player/wmcomponents.mspx
  26. Install Twitterrific from http://iconfactory.com/software/twitterrific
  27. Disable automatic synchronization for iPhones and iPods since this won’t be the primary iTunes machine by going to iTunes Preferences and enabling Disable automatic syncing for iPhones and iPods on the Devices tab.
  28. Install the iPhone SDK from http://developer.apple.com/
  29. Party! Or maybe just nap.

Oddities in Gathering Windows Performance Data

At Zenoss we do quite a bit of remote monitoring of computers running Windows. In the Enterprise edition of the product, we collect raw performance counter data using the conventional remote Windows Registry APIs.

We ran into an issue recently with a customer running Windows 2000 where the data from the remote server was being truncated prematurely. Since we implement our own remote API (so we can run natively on Linux and with Python, rather than requiring Windows), there was some immediately concern we ran into a low-level bug in our protocol implementation. Thanks to the release of the Windows Communications Protocols (MCPP) last year we have great detail on how our API layer should function.

Reviewing the MCPP in detail compared to our implementation showed no bugs against the specification, but I did notice some odd behavior. Normally when using the RegQueryValue API you specify a NULL buffer point and a zero-length buffer size so that the call will provide the actual size of the buffer needed. With this particular customer’s server I noticed that this behavior wasn’t behaving as documented in the MCPP.

An error code of ERROR_MORE_DATA was being returned. The MCPP says that when this value is returned the server will populate the size output variable with the actual size in bytes of the needed buffer. In this case, the size was always the same size as the input. After some experimentation I found that if I passed in approximately 64 Kbytes more data the call would finally succeed.

While quite odd behavior, this is actually the documented and expected state in the Win32 API documentation for RegQueryValueEx, but not in the MCPP. Specificially, when using the HKEY_PERFORMANCE_DATA key the ERROR_MORE_DATA behaves differently and the caller has more responsibility in guessing an appropriate buffer size.

The following pseudo-code shows the basic flow for how RegQueryValueEx should be used, either for locally or remote performance data access.

size = 65536 # starting size, probably computed from a previous registry call
params.in.data = params.out.data = buffer(size)
while 1:
    params.in.size = size
    params.out.size = 0
    dcerpc_winreg_QueryValue(params)
    if params.out.result == ERROR_MORE_DATA:
        size = size + 65536 # add another 64 Kbytes of data to the buffer
        params.in.data = params.out.data = buffer(size)
        continue
    break

After fixing that issue I was still left with one oddity. Let’s say, for example, it took 293,500 bytes of data before the RegQueryValueEx call was successful. And yet, the actual amount of returned data would only be 195,000 bytes, or something similar. This behavior seems quite different than on the other Windows operating systems we have tried so far.

This is the first time we’ve tried our data collection against a Windows 2000 server running Exchange locally. Windows 2000 has also been the source of several other key behavior differences in how performance data is returned, so my current speculation is how the server actually determines what data to be returned varies greatly between operating system versions. We normally query the performance counter registry for only a subset of values. It may well be that on Windows 2000 a buffer size large enough to retrieve all performance counters is required, even though once the call is complete it actually used quite a bit less.

Quirky, but another bug gone.

The Magic of a Quiet Fan

A previous post discussed using an Intel D945GCLF2 Atom-based mainboard for a little mini server setup. It’s been working great, except for that damn fan for the memory controller. It was pretty noisy at first, but quickly degraded into a crunching worthless disaster.

The product review comments at newegg described this exact behavior so I wasn’t too surprised. My friend Matt replaced his with a $3 fan immediately and has been happy with it since.

I opted for a slightly more expensive fan that is supposed to run at less than 14 dBA – a Silenx Ixtrema Pro Series. It’s still a 40mm fan with the same depth but I can’t hear it run at all. Product reviews imply this fan won’t stay quiet after about 6 months, so we’ll see if that happens. Ideally I’d just replace the crappy heatsink with a larger one and maybe a heat pipe to the case.

Now the only noise in the system is the WD Raptor drive. I’ve noticed the drive runs at between 50 and 55 degrees C since the case is not actively cool, so that is on the hot side (WD’s MTBF testing was done at 50 degrees). It is way overkill for this box, so I may replace it with a WD Green Power drive, or just give up and do a small SSD.

But hey, I’ve got over a month of cool looking graphs out of Zenoss now, such as my router network traffic:

Zenoss Network Performance Graph

Zenoss Network Performance Graph