Friday, March 11, 2016

A CRS Action Script for Solarwinds Database Performance Analyzer (DPA)

I've been using Solarwinds DPA for a few years now and find it to be a great tool for monitoring Oracle databases. It's one of the few GUI tools I actually enjoy working with. The interface is clean and provides great information without being cluttered and the navigation is intuitive.

We first installed DPA on a trial basis. It went on a box that, well, let's just say it was a place I could install some trial software without having to put a lot of effort into things. It worked out and we licensed it. In place. Over time, we've become more reliant on it for monitoring our Oracle databases. But it was still on a junker, single node database server.

A junker, single node database server that was about to go out of support.

We decided to move it to a RAC cluster. We'd get gaps in our monitoring when we applied database patches, or when the node had to be rebooted, or when the node just took a dump and RAC seemed like a nice solution.

DPA is really just a tomcat application that uses an Oracle database as a repository. With Grid Infrastructure it's possible to manage a resource like tomcat within CRS. This would mean DPA could be dependent on its database resource, start automatically on a server reboot, and on node failure it would relocate to another node.

For those that don't know, CRS manages resource through an action script. This is just a shell script that CRS runs that has options for starting, stopping, checking and cleaning a resource.

I figured Solarwinds would have a canned action script. Turns out they didn't, so I rolled my own. Here it is for anyone that's interested.

First is the action script itself:

 #!/bin/sh  
   
 # Startup, shutdown, clean and check script for SolarWinds DPA (Database Performance Analyzer) processes.  
   
 . /oracle/dpa/setenv.sh  
 logfile=$DPA_LOGS/dpactl.log  
   
 usage()  
 {  
  echo "Usage: $0 start|stop|check|clean"  
  exit 1  
 }  
   
 check_dpa()  
 {  
  check=$(ps -ef | grep $DPA_HOME | grep tomcat | grep -v grep | wc -l)   
  if [ $check -eq 1 ]  
   then exit 0  
   else exit 1  
  fi  
 }  
   
 getnow()  
 {  
  now=$(date '+%m/%d/%y %H:%M:%S')  
 }  
   
 case "$1" in  
    'start') getnow  
        echo "Starting Solarwinds DPA at ${now}." | tee -a $logfile  
        echo "This will take a few moments. Please be patient..." | tee -a $logfile  
        $DPA_HOME/startup.sh | tee -a $logfile  
   
        check_dpa  
        if [ $check -eq 1 ]  
         then getnow  
            echo "Solarwinds DPA started successfully at ${now}." | tee -a $logfile  
         else echo "There was a problem starting Solarwinds DPA!" | tee -a $logfile  
        fi  
    ;;  
   
    'stop') getnow  
        echo "Stopping Solarwinds DPA at ${now}." | tee -a $logfile  
        echo "This will take a few moments. Please be patient..." | tee -a $logfile  
        $DPA_HOME/shutdown.sh | tee -a $logfile  
        EXITCODE=$?  
        check_dpa  
   
        if [ $EXITCODE == 1 ] || [ $check -gt 0 ]  
         then error "There was a problem stopping Solarwinds DPA!" | tee -a $logfile  
        fi  
   
        getnow  
        echo "Solarwinds DPA stopped successfully at ${now}." | tee -a $logfile  
        exit 0  
    ;;  
   
    'check') check_dpa  
    ;;  
   
    'clean') dpa=$(ps -ef | grep $DPA_HOME | grep tomcat | grep -v grep | awk '{print $2}')  
        if [ ! -z $dpa ]   
         then kill -9 -f $dpa  
            exit 0  
         else exit 1  
        fi  
    ;;  
   
    *) usage  
    ;;  
   
 esac  

Next is a script that sets the environment, called at the beginning of the action script. It avoids having to hard-code directory paths into the action script itself.

 export DPA_HOME=/oracle/dpa/dpa_10_1_313  
 export DPA_LOGS=/oracle/dpa/logs  

Finally, there are the commands to create the Grid/CRS resource, as well as starting, stopping and relocating it.

 $CRS_HOME/bin/crsctl add resource solarwinds_DPA -type cluster_resource -attr "ACTION_SCRIPT=/path/to/script/solarwinds_dpa.sh, DESCRIPTION='SolarWinds Database Performance Analyzer', DEGREE=1, ENABLED=1, AUTO_START=always, START_TIMEOUT=0, CHECK_INTERVAL=60, STOP_TIMEOUT=0, SCRIPT_TIMEOUT=60, RESTART_ATTEMPTS=2, OFFLINE_CHECK_INTERVAL=0, START_DEPENDENCIES='hard(ora.repo.db)', ACL='owner:oracle:rwx,pgrp:oinstall:rwx,other::r--', CARDINALITY=1, FAILURE_INTERVAL=0, FAILURE_THRESHOLD=0"  
 $CRS_HOME/bin/crsctl start resource solarwinds_DPA  
 $CRS_HOME/bin/crsctl relocate resource solarwinds_DPA -n node2  

You'll need to install DPA on all nodes of the cluster, or in a shared directory. You'll also need a directory for the startup/shutdown logs.

Tuesday, March 8, 2016

The correct answer to the interview question, "Have you ever had to restore using a backup?"

I'm constantly amazed by the answers DBA candidates give to the question, "When was the last time you had to restore using a backup?" The most common is, "Our environment is pretty stable/robust/well engineered, we've never had to do that!"

I assume it's meant to indirectly demonstrate their capabilities as a DBA, engineer or architect, and it's great that you've never had a production situation that required you to recover.

But what happens when you do have to recover.

Recently, a candidate told me that they hadn't had to recover but the process was well documented.

Right.

I wondered, "When?" Documents can be incomplete. Processes change over time. Systems are upgraded and improved and migrated to new hardware. That documentation may have been accurate when written, but the evolution of the environment may have made it confusing, misleading or even wrong. It may have made sense to the person who wrote it but it may contain assumptions or be missing steps. You may not want to discover any of this with the CEO standing over your shoulder.

Recovery is arguably the most stressful situation a DBA may encounter in their career. A slow-performing database is still running, but when you're down waiting for recovery to complete, all eyes are on the DBA. A multitude of variables add to the pressure—do I have all the files I need, did I remember to set the environment correctly, is there enough space on the recovery system?

Most DBAs who haven't had to perform a production recovery don't understand that the situation is rarely ideal. Databases, like kids and pets, only get sick after the doctor's office has closed for a long holiday weekend. (The most significant recovery I've performed began late on the Friday of Memorial Day weekend. I was alerted as I was walking out the door for a trip with my wife and kids.)

Consider that during a restore you may be at the mercy of others. You may have to rely on storage admins to make space available, network admins to open routes, system admins to build hosts from bare metal, colocation facilities to grant physical and virtual access...

Why would you leave any of that to chance?

The correct answer to the question, "Have you ever had to restore using a backup?" is (minimally) "Yes, during the last quarter I performed a restore as part of a regular readiness exercise."

If your employer doesn't require this, fight to make it their policy. At the very least you'll be able to say that you take backup and recovery seriously and demonstrate a commitment to keeping an environment stable and highly available.

DBAs that aren't practicing recovery are akin to soccer or hockey goalies who never practice defending penalty shots. The heat of the moment is not the time to be figuring it out!