Information Overload

August 14th, 2012

Anyone who has been in IT for more than 2 days knows that systems monitoring is important. Heck, I’m willing to bet you know it’s important even if you’ve only been in the field for 2 hours. If a system goes down, you need find out about sooner than the customer, because when the customer finds out, you want to be able to say “Yes, we’re addressing the situation now.” Especially if the customer is your boss.

There is, however, a trap you can all too easily fall into – monitoring too much. Or at least, giving your sysadmins too much information from the monitoring system. Some of this is inevitable, especially when first setting up a new monitoring system. Some of it is politically driven, and you will get alerts for log messages that really don’t mean diddly in the grand scheme of things, because auditors want to see green lights turn yellow when someone mis-types a password. Much as it pains me to admit, these are some things that you just need to learn to live with.

What you should be careful to avoid at all costs is alerting people when things are working. Especially sysadmins. Quite often, the sysadmins don’t care – at all – if things are working as expected. A typical response: “So what? I’ve got these three fires to put out, five systems to build, four developer questions, and eight help desk escalations to deal with. Could you please go away and let me do my work?”

Okay, so obviously nobody will come up to you and congratulate you when your systems are working – your boss might show his/her appreciation somehow, but that will be at a group lunch or during a staff meeting or something where you’re already blocked off from doing useful work. So what’s the concern here? Email. Automated script email output. When you put a monitoring script in place, do not, for the love of the little people, let it send email to the entire world when it finishes running and all it’s checks have been passed. Have it sit there silently and quietly. Only when it notices a failure should it make it’s presence known. If it talks too much, your sysadmins will start to ignore it – just like the little boy who cried wolf. Then when there is a problem, that signal gets lost in the noise of the chattering script.

Programming Woes

August 6th, 2012

In my career as a sysadmin, I have admittedly written less documentation than I perhaps should have. Some of this was because I felt like the people who would be using the programs / scripts I was writing should know enough to not need documentation, but some fraction of it was just outright laziness. I admit that.

When I did write documentation, though, I made damn sure it was accurate. Right now, I’m starting work on a new project using TurboGears2 on RHEL 6, and the upstream TG2 documentation seems quite thorough – but is horribly inaccurate in some rather annoying ways. I am admittedly not using the most recent version, since I wanted to use the version packages with RHEL – but even looking at the documentation for 2.0, it is inaccurate for what is actually available. The first error was in how the upstream docs told me to start a new project – the TG2 documentation for version 2.0 tells me to “paster quickstart” but the actual software (version 2.0.3) makes me run “paster create”. I had to figure this out looking at the error message I got in my shell when I ran “paster quickstart”. I’m not sure whether to blame upstream for bad documentation (quite possible) or Red Hat for dorking around with the package (also quite possible), but the end result in either case is a bad experience and a low opinion (so far) of TurboGears2 on RHEL.

Excessive automation

July 23rd, 2012

In this line of work, automation is a Good Thing. There are so many things you can do to make your life easier with scripting it’s ridiculous. Just looking at doing new installs of linux machines, you can do pretty much anything you want to in the %post section of a kickstart. And I do mean anything – you could, if you wanted to, even blow away the linux you just finished installing and lay down a copy of Windows! Why you would want to do that remains a mystery, but it’s possible.

However, there does come a point where you’re over-automating, possibly because you’re trying to control too much. Two huge examples of what I consider over-automation: 1) fixing the position of kernel stanzas within the grub.conf file, and 2) munging files to conform to “preferred” presentations when there is no functional difference between what the system generates and the “preferred look”.

First, fixing stanza locations in grub.conf. Most of the time (most being 99.9%), grubby automatically puts new kernels at the beginning of the file, since they appear at the top of the list when you’re booting. That means it’s already at position 0. Munging a file like this by hand is HARD. The only people I’m aware of who are qualified to write a parser to ensure you always generate a correct file are the people interested in writing compilers, not one-off scripts to deal with grub.conf syntax. Don’t try to parse it yourself – you will get it wrong. You may get it right for the first 6 months, or even 5 years, but eventually, your script will be wrong, and by that time, you probably will have forgotten where the script is and forgotten what the heck you were doing inside it, and nobody will be able to fix it. Let the system manage the order of kernel stanzas for you – the default kernel doesn’t have to be the first listed, that’s why there’s a “default=” option. Besides, if you’re a linux admin, you should be able to look at the file and figure out the default kernel from the default= option when you’re blind stupid drunk and half asleep.

Similar reasoning tells you not to munge system-generated files for no other reason than a preferred presentation. Example? The kickstart creates an /etc/sysconfig/network file with HOSTNAME=<FQDN>. It’s functional. It’s system-generated. It will work just fine forever and ever amen. I know lots of people prefer to see HOSTNAME=<short name> and DOMAIN=<domain>. There’s no reason for that, they are functionally equivalent. Munging that file in that way will cause breakage as soon as the variable to specify the host name changes – granted, that may not happen until version five hundred thirty-three and seven tenths, but as soon as it does happen, *BOOM* your kickstart is broken. And you’re probably not sure where it’s broken, so you’re not sure how to fix it. If you’re even still at that job, which these days isn’t all that uncommon, so your inheritors are cursing your name and your father and your father’s father unto the seventh generation for inflicting them with a kickstart that is broken in a way they can’t figure out.

Like with beer, moderation is the key to automation. Automate what you can easily automate, automate what makes sense to automate, but don’t try to automate everything just for the sake of automation.

Happy Password Change Day

June 22nd, 2012

If you work at a place like my company, your passwords expire regularly. If you have a job similar to mine, you have a whole mess of systems that you have to change your password on. If you have a personality like mine, this is a really boring task that you’d rather not deal with, but you have to.

Well, I have a solution for you. It’s based on my last post, that magical Python pexpect script. It’s stripped down a little more, but I’m sure you’ll find most of it very familiar if you read through the last one. So, without further ado, I introduce to you my magically delicious password change script:

#!/usr/bin/python
#
# Change a user's password on multiple systems, ensuring that given user
# has valid sudo access. (2 birds, 1 stone)
#
# It requires that you have sudo available on the target(s) and that you
# can run the given command under sudo. It does not require SSH keys be set
# up, since it handles the password dialogs for both SSH login and sudo
# access.
#
# Any vulgarities in this code are the result of being lazy about
# case sensitivity checks, and are not deliberate. If you decide to be
# offended, you need to get over it.

import pexpect
from optparse import OptionParser
import os
import getpass
import signal
import sys
from datetime import datetime

DEBUG = 0
jams = []
misses = []
hits = []

def getTargets(hostspec):
  global DEBUG
  if os.path.isfile(hostspec):
    if DEBUG:
      print "Reading hosts from file "+hostspec
    fh = open(hostspec, 'r')
    hosts=fh.read()
  else:
    if DEBUG:
      print "Using hosts from command line."
    hosts=hostspec
  return hosts.split()

def pullTrigger(target, oldpass, newpass, username):
  global DEBUG, jams, misses, hits
  rangeHot = "\$ "
  # First, we launch the ssh process and get logged in to the target
  # Set a 5 minute timeout on commands, not 30 seconds
  proc = pexpect.spawn("ssh "+target)
  while True:
    index = proc.expect(["The authenticity of host", "assword:", "Permission denied", rangeHot, pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline("yes")
    elif index == 1:
      proc.sendline(oldpass)
    elif index == 2:
      jams.append(target)
      if DEBUG:
        print "Dud cartridge. Clearing chamber, proceeding with firing plan..."
      proc.kill(signal.SIGKILL)
      return
    elif index == 3:
      break
    elif index == 4:
      jams.append(target)
      if DEBUG:
        print "Cartridge jammed, clearing chamber, proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return
    elif index == 5:
      jams.append(target)
      if DEBUG:
        print "Squib load. clearing chamber, proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return

  # Go root
  if DEBUG:
    print "Becoming root inside expect spawn."
  rangeHot = becomeRoot(proc, oldpass)
  if rangeHot == "EOF":
    misses.append(target)
    if DEBUG:
      print "Missed target low. Proceeding with firing plan."
    proc.kill(signal.SIGKILL)
    return
  if rangeHot == "TIMEOUT":
    misses.append(target)
    if DEBUG:
      print "Missed target high. Proceeding with firing plan."
    proc.kill(signal.SIGKILL)
    return

  # Change password
  proc.sendline("passwd "+username)
  proc.expect(":")
  proc.sendline(newpass)
  proc.expect(":")
  proc.sendline(newpass)

  index = proc.expect([rangeHot, pexpect.EOF, pexpect.TIMEOUT])
  if index != 0:
    misses.append(target)
    if DEBUG:
      print "Missed wide left. Proceeding with firing plan."
    proc.kill(signal.SIGKILL)
    return

  # A hit! A veritable hit! O frabjious day!
  hits.append(target)
  rangeHot = exitRoot(proc)
  proc.sendline("exit")

def exitRoot(proc):
  global DEBUG
  # Quick and dirty. This should really be nicer, but I'm lazy and it's
  # almost guaranteed to work if you actually got this far.
  if DEBUG:
    print "Leaving root shell."
  proc.sendline("exit")
  proc.expect("\$ ")
  return "\$ " 


def becomeRoot(proc, passwd):
  proc.sendline("uname -s")
  index = proc.expect(["SunOS", "Linux"])
  if index == 0:
    proc.sendline("super root-shell")
  elif index == 1:
    proc.sendline("sudo su -")
  while True:
    index = proc.expect(["assword", "\# ", pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline(passwd)
    elif index == 1:
      return "\# "
    elif index == 2:
      return "EOF"
    elif index == 3:
      return "TIMEOUT"

def main():
  global DEBUG, jams, misses, hits

  # Set up command line options / arguments
  parser = OptionParser()
  parser.disable_interspersed_args()
  parser.set_defaults(saveResults=True)
  parser.add_option("-H", "--hosts", dest="hostspec", help="hosts to run the command(s) on", metavar="HOSTSPEC", default="pyexphosts")
  parser.add_option ("-d", "--debug", action="store_true", dest="debug", help="print debugging messages")
 
  (options, args) = parser.parse_args()

  if options.debug:
    DEBUG=1
  
  targets = getTargets(options.hostspec)

  username = raw_input("User name to change password for: ")

  oldpass = getpass.getpass("Old password: ")
  newpass = getpass.getpass("New password: ")

  for target in targets:
    if DEBUG:
      print "Launching commands at target "+target
    pullTrigger(target, oldpass, newpass, username)

  if (len(jams)):
    print "Jams noticed:"
    for target in jams:
      print "Target "+target
  if (len(misses)):
    print "Misses noticed:"
    for target in misses:
      print "Target: "+target
  print "Done changing passwords."

if __name__ == "__main__":
  main()

Run ALL the things – everywhere!

June 20th, 2012

Yes, that meme is a bit overused and trite. That’s okay, it’s still fun. At least, I think it is, and since I’m the author, my opinion is the one that counts.

So why am I using it? Well, I came across some information I needed to collect from all of our Linux systems the other day. We have an in-house routine called ‘rrun’ that will let us launch commands on a specified set of systems, as root, on demand. Simple solution, right? Well, not really – unfortunately, the thing I needed to run wouldn’t run properly inside of the ‘rrun’ tool. What’s a poor deprived sysadmin soul to do in this situation?

Hopefully not what I did. I basically reinvented the wheel – though I think I made it better.

I remembered using an expect-based script many years ago that would ssh out to various systems and run commands for you, and thinking it was a wonderful thing. Well, I didn’t have that script any longer, and since I didn’t really want to re-learn Tcl, I looked for alternatives. I found Python’s pexpect module, which is basically a reimplementation of expect in Python.

After a bit of thinking and a lot of coding, I came up with the code you see below. If you like it, feel free to use it, though do be warned that the version I’m posting has not been extensively tested or Fred-proofed. I’ve also got some work left on refining the debugging levels and such, but that’s for later.

And yes, I did have firearms on the brain when I was writing it.  🙂

#!/usr/bin/python
#
# Clone of 'rrun', an internal program that runs a command as root on
# multiple target systems.
#
# It requires that you have sudo available on the target(s) and that you
# can run the given command under sudo. It does not require SSH keys be set
# up, since it handles the password dialogs for both SSH login and sudo
# access.
#
# Any vulgarities in this code are the result of being lazy about
# case sensitivity checks, and are not deliberate. If you decide to be
# offended, you need to get over it.

import pexpect
from optparse import OptionParser
import os
import getpass
import signal
import sys
from datetime import datetime

DEBUG = 0
jams = []
misses = []
hits = []

def getTargets(hostspec):
  global DEBUG
  if os.path.isfile(hostspec):
    if DEBUG:
      print "Reading hosts from file "+hostspec
    fh = open(hostspec, 'r')
    hosts=fh.read()
  else:
    if DEBUG:
      print "Using hosts from command line."
    hosts=hostspec
  return hosts.split()

def loadAmmunition(cmdspec):
  global DEBUG
  if os.path.isfile(cmdspec):
    if DEBUG:
      print "Reading commands from file "+cmdspec
    fh = open(cmdspec, 'r')
    commands = fh.read()
    fh.close()
  else:
    if DEBUG:
      print "Using commands from command line."
    commands = cmdspec
  return commands

def readPassword():
  return getpass.getpass("Use what password? ")

def pullTrigger(target, cmds, passwd):
  global DEBUG, jams, misses, hits
  rangeHot = "\$ "
  # First, we launch the ssh process and get logged in to the target
  # Set a 5 minute timeout on commands, not 30 seconds
  proc = pexpect.spawn("ssh "+target, timeout=300)
  while True:
    index = proc.expect(["The authenticity of host", "assword:", "Permission denied", rangeHot, pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline("yes")
    elif index == 1:
      proc.sendline(passwd)
    elif index == 2:
      jams.append(target)
      if DEBUG:
        print "Dud cartridge. Clearing chamber, proceeding with firing plan..."
      proc.kill(signal.SIGKILL)
      return
    elif index == 3:
      break
    elif index == 4:
      jams.append(target)
      if DEBUG:
        print "Cartridge jammed, clearing chamber, proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return
    elif index == 5:
      jams.append(target)
      if DEBUG:
        print "Squib load. clearing chamber, proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return

  # We're logged in. Create the shell file with the commands.
  proc.sendline("echo "+cmds+" > /tmp/expectcmd.sh")
  index = proc.expect([rangeHot, pexpect.EOF, pexpect.TIMEOUT])
  if index != 0:
    misses.append(target)
    if DEBUG:
      print "Stop firing into the ceiling! Proceeding with firing plan."
    proc.kill(aignal.SIGKILL)
    return

  # Go root (if indicated by sys.argv[0])
  if (sys.argv[0].endswith("rlaunch") ):
    if DEBUG:
      print "Becoming root inside expect spawn."
    rangeHot = becomeRoot(proc, passwd)
    if rangeHot == "EOF":
      misses.append(target)
      if DEBUG:
        print "Missed target low. Proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return
    if rangeHot == "TIMEOUT":
      misses.append(target)
      if DEBUG:
        print "Missed target high. Proceeding with firing plan."
      proc.kill(signal.SIGKILL)
      return

  # Execute the command, redirecting stdout/stderr
  proc.sendline("/bin/sh /tmp/expectcmd.sh > /tmp/expectcmd.out 2>/tmp/expectcmd.err")
  index = proc.expect([rangeHot, pexpect.EOF, pexpect.TIMEOUT])
  if index != 0:
    misses.append(target)
    if DEBUG:
      print "Missed wide left. Proceeding with firing plan."
    proc.kill(signal.SIGKILL)
    return

  # A hit! A veritable hit! O frabjious day!
  hits.append(target)
  if ( sys.argv[0].endswith("rlaunch") ):
    rangeHot = exitRoot(proc)
  proc.sendline("exit")

def exitRoot(proc):
  global DEBUG
  # Quick and dirty. This should really be nicer, but I'm lazy and it's
  # almost guaranteed to work if you actually got this far.
  if DEBUG:
    print "Leaving root shell."
  proc.sendline("exit")
  proc.expect("\$ ")
  return "\$ " 


def becomeRoot(proc, passwd):
  proc.sendline("sudo su -")
  while True:
    index = proc.expect(["assword", "\# ", pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline(passwd)
    elif index == 1:
      return "\# "
    elif index == 2:
      return "EOF"
    elif index == 3:
      return "TIMEOUT"

def cleanBrass(target, passwd):
  global DEBUG
  rangeHot = "\$ "
  if DEBUG:
    print "Cleaning up spent brass for target "+target
  # First, we launch the ssh process and get logged in to the target
  # Set a 5 minute timeout on commands, not 30 seconds
  proc = pexpect.spawn("ssh "+target, timeout=300)
  # We don't handle certain types of things we do in pulling the trigger since
  # we already know we succeeded once so we will succeed again.
  while True:
    index = proc.expect(["The authenticity of host", "assword:", rangeHot])
    if index == 0:
      proc.sendline("yes")
    elif index == 1:
      proc.sendline(passwd)
    elif index == 2:
      break

  # Go root (if indicated by sys.argv[0])
  if (sys.argv[0].endswith("rlaunch") ):
    if DEBUG:
      print "Becoming root inside expect spawn."
    rangeHot = becomeRoot(proc, passwd)
    if rangeHot == "EOF":
      if DEBUG:
        print "Spent brass behind you, not on range."
      proc.kill(signal.SIGKILL)
      return
    if rangeHot == "TIMEOUT":
      if DEBUG:
        print "Can't find any spent brass.."
      proc.kill(signal.SIGKILL)
      return

  # Execute the command, redirecting stdout/stderr
  proc.sendline("/bin/rm -rf /tmp/expectcmd.sh /tmp/expectcmd.out /tmp/expectcmd.err")
  proc.expect(rangeHot)
  if (sys.argv[0].endswith("rlaunch") ):
    rangeHot = exitRoot(proc)
  proc.sendline("exit")


def collectTarget(target, passwd):
  global DEBUG
  if DEBUG:
    print "Collecting results from target "+target
  proc = pexpect.spawn("scp "+target+":/tmp/expectcmd.out "+target+".out")
  while True:
    index = proc.expect(["assword:", "\$ ", pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline(passwd)
    elif index == 1:
      break
    elif index == 2:
      break
    elif index == 3:
      if DEBUG:
        print "Can't find target. Proceeding to next collection."
      break
  proc = pexpect.spawn("scp "+target+":/tmp/expectcmd.err "+target+".err")
  while True:
    index = proc.expect(["assword:", "\$ ", pexpect.EOF, pexpect.TIMEOUT])
    if index == 0:
      proc.sendline(passwd)
    elif index == 1:
      break
    elif index == 2:
      break
    elif index == 3:
      if DEBUG:
        print "Can't find target. Proceeding to next collection."
      break

def setupTargetFile(dirname):
  if os.path.isdir(dirname):
    d = datetime.now()
    os.rename(dirname, dirname+d.isoformat('@'))
  os.mkdir(dirname)

def main():
  global DEBUG, jams, misses, hits

  # Set up command line options / arguments
  parser = OptionParser()
  parser.disable_interspersed_args()
  parser.set_defaults(saveResults=True)
  parser.add_option("-c", "--commands", dest="cmdspec", help="one-line command or file with commands to run", metavar="CMDSPEC", default="pyexpcmds")
  parser.add_option("-H", "--hosts", dest="hostspec", help="hosts to run the command(s) on", metavar="HOSTSPEC", default="pyexphosts")
  parser.add_option("-r", "--results", dest="resdir", help="store results files in DIR", metavar="DIR", default="pyexpresults")
  parser.add_option("-R", "--no-results", dest="nolog", action="store_true")
  parser.add_option("-p", "--password", dest="passwd", help="optional password to use (if not specified, you will be prompted)", metavar="PASSWORD")
  parser.add_option ("-d", "--debug", action="store_true", dest="debug", help="print debugging messages")
  parser.add_option ("-n", "--no-clean", action="store_true", dest="nocleanup", help="Do not clean up the results files on the target systems")
 
  (options, args) = parser.parse_args()

  if options.debug:
    DEBUG=1
  
  targets = getTargets(options.hostspec)

  cmds = loadAmmunition(options.cmdspec)

  if not options.passwd:
    password = readPassword()
  else:
    password = options.passwd

  for target in targets:
    if DEBUG:
      print "Launching commands at target "+target
    pullTrigger(target, cmds, password)

  if options.nolog:
    if DEBUG:
      print "Discarding targets."
  else:
    if DEBUG:
      print "Collecting targets..."
    setupTargetFile(options.resdir)
    os.chdir(options.resdir)
    for target in hits:
      collectTarget(target, password)
    os.chdir('..')

  if not options.nocleanup:
    if DEBUG:
      print "Cleaning up spent brass from misses."
    for target in misses:
      cleanBrass(target, password)

    if DEBUG:
      print "Cleaning up spent brass from hits."
    for target in hits:
      cleanBrass(target, password)

  if DEBUG:
    if (len(jams)):
      print "Jams noticed:"
      for target in jams:
        print "Target "+target
    if (len(misses)):
      print "Misses noticed:"
      for target in misses:
        print "Target: "+target
    print "All ammuntion spent. Hope you had fun at the range!"

if __name__ == "__main__":
  main()

Making things go!

June 13th, 2012

When I started this job, it took me about a week – just under, really – to figure out some ways I could make some very quick and very effective improvements. I’ve gone over some of those in varying details in previous posts, it’s time to detail one of them in particular that I just accomplished.

One of the pain points I identified was the distribution of sysadmin tools. Before I arrived, it had been done by scp’ing a directory to newly deployed servers. I think we can all see the problems with that — data divergence, lack of updates, far too easy to forget to update one or more systems when a given script changes… fun times. I decided Something Had To Be Done. And Quickly.

So I did something. I started out by building a package of those scripts and getting it distributed by our RHN Satellite. That was fairly easy – once I had the package, I just sign and push. Then I started to tackle the whole creating the RPM package bit, which was going to be a wee bit more difficult.

I started out with an empty SVN repository. I couldn’t figure out a clean way of keeping the specfile for the package in with the source tree, so I create two main directories in the repo – packages and specs. The specs directory just has the specfile, nothing more. The packages directory has all the fun stuff. Since I didn’t want packages to bleed through to each other, I then created a new directory for the first package, let’s call it “adminscripts” (no that’s not the actual name I used, I’m sanitizing things as I write).

Inside the adminscripts directory, I established the usual trunk-tags-branches structure so common to SVN projects. This turned out to make things much easier down the line, but I can’t claim any sort of prescience about it – I just did it out of habit and because that’s the way the smart people do things. I’ve got the usual src/ directory off the main project directory, and a Makefile at the top level, so no surprises there. Making commits to the project, updating the source tree, and all that jazz is now “industry-standard” – anyone can start contributing as long as they know how things are done in 90% of open-source projects.

Now comes the first challenge – how do I start with this SVN repository and extract a tar bundle of just the source code? Well, that’s sort of simple, just check out the code and get rid of the “.svn/” directories everywhere, then bundle it up – but I don’t necessarily want to build HEAD. Hmm. Okay, let’s use the tags/ directory and check out a specific tag. This also forces an extra step on the coders to tell the build system that a given revision is ready for packaging, not entirely a bad thing. So we tag it with the release and version we want the RPM package to be, and check out that tag.

Okay, so there’s at lease one important detail – the checkout needs to be renamed after removing the .svn/ directories and before being bundled, since the rpmbuild process expects a directory named %{NAME}-%{VERSION}. That’s just an ‘mv’ command, though.

So now I have a way to get a specific version-release, how do I figure out *which* version-release? Turns out that’s remarkably simple – just parse the specfile with a little “awk”. I think I mentioned in a previous post just how much I love my little friend ‘awk’… anyway. Once I have the bundle, it’s a simple process to move the bundle and specfile into place and launch an rpmbuild job.

But wait… I don’t want to keep rebuilding the same thing every night if there’s no need to. Which means I need to track the builds I’ve done – or at least the ones that have succeeded. I chose to use a PostgreSQL database to do so, though I could have just as easily used any other database – or probably even flat files. I also want to know who to email on build errors – oh and on successes as well, that would be cool – so I throw that into the database.

Without going into too much detail about the database layout, I log which package-version-release combinations are built and when, and also log which emails go with errors and successes for which packages. Then I glue them together with a script that parses the specfile for the “current” version-release of all packages, checks to see if a build has been done, and if not launches the build script.

So basically, in my first month-and-change, I’ve created an end-to-end automated CI build process that goes from source code check-ins to a package ready for signing and distribution. Sure, it’s small scale and systems-oriented rather than application-oriented, but it is a major accomplishment. Plus, it can be easily extended to build applications for deployment – I designed it to be extensible that way. Does it have some limitations? Sure – but for a company this size (~300 employees) in the IT industry (our primary focus is providing web and other IT based services), it’s a pretty hefty addition to the arsenal.

VMware: Taking two steps back for every step forward

June 8th, 2012

Some of you may remember an earlier rant I went on about VMware support. I’m glad to say that it got resolved at the time, though not as smoothly as I could have hoped. Still, it got resolved, and I went on my happy way.

I’m sorry to report that VMware has failed me yet again – this time in a spectacularly embarrassing way for it’s engineering department. See, I’m at a new job, and the new job also uses VMware virtualization – fairly heavily. We’ve been running into some performance problems on the guest VMs related to either not having VMware-tools installed, or it being out of date. That is clearly our problem – and we started resolving it. By having to compile VMware tools manually, since for some reason the kernel we’re running (RHEL 5, 2.6.18-308.1.1.el5) doesn’t have precompiled modules. No biggie, it still works.

Well, compiling manually by running vmware-tools-config.pl on hundreds of boxes isn’t gonna fly, so I looked in to ways of compiling once and pushing out packages from our RHN Satellite. This is when VMware started to impress me. They have a new method of distibuting the vmware-tools package via YUM repositores. I found a treasure trove of RPMs at http://packages.vmware.com/tools, including a source RPM for the kmod package. Hallelujia! Oh frabjous day! My task has just been vastly simplified!

So I pulled down the source RPM for the kernel modules for the version of ESX we have and launched an “rpmbuild -bb”. The build failed. Wait, WHAT? Turns out that the source for the vmxnet and vmxnet3 modules have a conflicting definition for “struct napi_struct”. Some research led me to figure out that the kmod source for 4.0U2 was okay, since it had taken into account a port for GRO that Red Hat had done. I created diffs of those two trees, added the patches to the 4.0U1 build, and the package built. Okay, a little annoying, but understandable.

Now that I have a kernel module package, time to start pulling down all the other packages – for which, I should note, the source RPM is *not* available. So I pull them down one by one, starting with just the base “vmware-tools” package, doing a manual dependency resolution with wget and “rpm -qi –requires”. Well, I finally get to the “vmware-open-vm-tools-xorg-utilities” package, and it requires both xorg-x11-drv-vmware and xorg-x11-drv-vmmouse. Those are actually in the base RHEL channel.

This is where VMware failed utterly and completely. The binary RPM on their site, in the directory at http://packages.vmware.com/tools/esx/4.0u1/rhel5/x86_64, has version dependencies. Specifically, it depends on:

  • xorg-x11-drv-vmware >= 10.15.2.0
  • xorg-x11-drv-vmmouse >= 12.4.3.0

The versions of these packages available from Red Hat via RHN?

  • xorg-x11-drv-vmware           10.13.0-2
  • xorg-x11-drv-vmmouse        12.4.0-2

That’s right, the VMware packages for RHEL 5, as provided by VMware, require a version of Red Hat packages that doesn’t exist! What’s worse is the xorg-x11-drv-vmmouse packages doesn’t seem to exist for RHEL 6, so I can’t even try to back-port the RHEL 6 packages to RHEL 5. Which means that the past 3 hours of work in trying to generate local packages to install VMware Tools and not have to do so manually was wasted because VMware’s build system for their vmware-tools package is fundamentally broken. Did nobody at VMware bother to do any quality checking to ensure these packages can be installed? Does anybody from VMware realize just how idiotic the entire VMware organization looks to me right now?

EDIT: I’ve now found the two packages that provide the xorg-x11-drv-vmmouse and xorg-x11-drv-vmware versions required. They’re in the VMware download folder, but they have “vmware-open-vm-tools-xorg” type names. Not really too smart there, VMware… either use the same name as upstream, or require a different package name please. Don’t muddy things up like that. You look a lot less idiotic now, but you still look idiotic.

Vendor tools

June 5th, 2012

A number of the vendors that make the products I use seem to have a bit of a disconnect. They look at modern corporate computing and see Windows ruling the desktop. Which it does, no question about it – and with (for the most part) good reason. So, they focus their efforts on a Windows way of managing their products.

Don’t get me wrong here, I haven’t taken leave of my senses and embraced Microsoft. I still don’t like the Windows world, and I still think Windows does plenty of things wrong. But it does enough right that it makes sense, even to me, as the corporate desktop of choice.

Unfortunately, my desktop is not where I do most of my work. It’s on the server(s), which don’t have Windows GUIs, they have the command line tools. What vendors don’t seem to understand is that anything they put in the GUI should also be available in the CLI. Allow me to provide a specific example: we use Symantec NetBackup for our backups. I’ve just been given ownership of backups for all our UNIX / Linux servers, so I want to know what’s going on with it. To that end, I’m trying to write some scripts that give me the information I want on a routine basis. Thanks to another sysadmin who’s a friend, I found the “bpdbjobs” binary – and oh what a wonderful binary it is. Unfortunately, it will either give me verbose information about database entries, or it will give me headers for the very limited information in it’s default report. It will not give me headers for the verbose report, which is the combination I need in order for it to be useful.

Symantec, please give me a verbose (hint: -all_columns) report from bpdbjobs with header information. Yes, I can back up, yes I can restore, but if I don’t know – and can’t find out using your tools – what I’m backing up successfully, your product still isn’t doing what it should be.

The little things…

June 1st, 2012

It always seems that it’s all the little details that trip us up and cause the biggest problems. In life and in systems administration.

Well, it’s also those little things that have the biggest impact on others, and do the most towards solving problems and making life better. Yesterday, I was given “ownership” of our corporate backups. Ultimately, this just means that I take the blame for any major casters-up events, but it also means that I can make changes where I deem appropriate.

One of the things about the backup environment is that it send out a report every morning regarding the previous day’s backup runs, which has to be checked for errors. The report is generated by doing nothing more than running a series of commands against the backup database, so it reports all the jobs in chronological order based on start time. While this makes perfect sense, it is rather frustrating to have to page through a 2000+ line email to try and determine if there were errors. It’s far too easy to overlook the one character that has the numeric job status when you have 1,998 zeros and two sixes in that column.

Well, computers are great at repetitive tasks like “check each line of data for a non-zero value in this column”, so I decided to do some slicing and dicing. I brought out an old and trusted friend, awk, and told it to find me all the non-zero values and report them to me, then to find the machines those values are associated with and report on all jobs on those machines. Then I had it put all that information at the beginning of the nightly report email so I don’t have to scroll through the huge 2000+ line report to find the anomalies. I went ahead and let the report script put the whole big thing at the end, just like it wanted to, to avoid making it jealous of awk and getting all whiny and constipated later, but I really do like the ‘awk-ed’ aection better.

This morning, the other person whose job it is to go through this email and find the misbehaving systems came by and thanked me for making his job easier and his life better, since he now only has to spend about 30 seconds reading this email versus 5 to 10 minutes previously.

The Devil may be in the details, but often so is Salvation.

Reflections on a new job

May 25th, 2012

At the end of my third week in the new job, I’ve come to a couple realizations. Most recent is that I’m having to watch the clock closely so that I leave on time. I’m a contractor, so I’m hourly, not salaried. That means it’s all too easy for me to start putting in overtime at a job I enjoy, and this i just such a job. Problem is, the manager here isn’t allowed to give me overtime without the approval of his higher-ups, and unless something major goes bonkers that approval probably won’t be forthcoming. This sort of bugs me, because I’ve already found myself wanting to stay later and get something finished, but it’s ultimately not that big an issue, I just have to budget my time closely.

I’m also re-discovering why I went in to IT in general – and system administration in particular – in the first place. I’m having a large influence on how things are done already, making people’s lives easier and helping them sleep better at night. At least, I think I am, which is much the same thing as far as my enjoyment of the job is concerned. I’ve also already been given opportunities for input that my former boss would have never given me, and made decisions he would have never approved of simply because he didn’t understand the logic behind them (or was too caught up in having to appear in control of everything to actually let anyone else get any work done).

In future posts, I may detail some of the automation stuff I’m putting into place. It’s not really all that ground-breaking; honestly, it’s somewhat pedestrian and rote, simply because there are so many other places that have done similar things before. Still, it’s all my stuff, customized specifically for my new job, and I take a great deal of pride in the fact that not only did I put it into place, but that it’s been well-received by the admin team here and that it has (so far in it’s limited lifetime) worked quite well. Not flawlessly, but quite well.

I’ve still got a good bit of work to do to add functionality I want to add to what I’ve already put in place, but I’m going to launch by next big initiative next week, addressing a completely different topic than workflow automation. It’s something I see as sorely needed (again) for the future of the environment, but it’s significantly larger and more invasive than the simply installation automation I’ve done so far so I don’t know how well it’s going to fly. I have faith that the people here will see the need for it and the logic behind it, I just don’t know how easy it will be to actually make the necessary changes to put it into place. I’m hopeful.

Wow. That’s something I haven’t been able to say about my job for a long time. It feels really good to be able to say that.