Monday, February 19, 2018

Rapidly increasing numbers of scheduler jobs in Oracle (11gR2 edition)

I recently encountered an interesting issue in our Oracle 11gR2 databases (it occurs in 12c also) where the number of scheduler jobs increases without check. If you're running DBMS_SCHEDULER jobs this could be impacting you. I'm a little embarrassed to admit that I didn't discover the root cause at first, instead I realized that queries on DBA_OBJECTS were performing poorly. Scheduler jobs are objects, and, well, after thousands were populated in DBA_OBJECTS we hit a threshold where the execution plan changed on these queries. Digging backwards I realized that there were a ton of rows with object_type = 'JOB'.

The issue is described in Metalink document 2001165.1 and caused by multiple bugs. I ended up running a slightly different solution than recommended. This is partly because the described solution isn't quite foolproof and partly that PL/SQL seems, to me, a better, faster, and more elegant solution than the recommended method of writing dynamic SQL to a file and then executing the file.

I solved it with:

alter system set job_queue_processes=0 scope=memory sid='*';

     begin
       for c1 in (
              select job_name, state
                from dba_scheduler_jobs
               where job_name like 'AQ$_PLSQL_NTFN_%')
      loop

                begin
                   if c1.state = 'RUNNING'
                 then dbms_scheduler.stop_job(c1.job_name);
              end if;
                      dbms_scheduler.drop_job(c1.job_name);
            exception
                 when others then null;
                   end;
  end loop;
       end;
/

Feel free to put something more meaningful than a "when others then null" exception handler here. I was lazy!

Then apply the recommended patch and bounce the instance. In a RAC environment the patch can be applied and instance restarts rolled across the nodes, but ideally don't revert job_queue_processes on a node until it has been restarted, else you'll see a new batch of jobs accumulating to be purged.

NOTE: I did try applying the patch in one environment prior to setting the job_queue_process parameter and purging the orphaned records, but the issue persisted. It appears to be sensitive to the order:

  • Set job_queue_processes=0
  • Purge the records
  • Apply the patch
  • Restart the instance and restore the value of job_queue_processes (automatic if you use scope=memory)
This will mean that the job scheduler will not be running for the duration of patching. In a single instance system you have downtime from the restart but in a RAC environment jobs will be suspended only for the duration of the code run, patching, and restart. It is possible to run the PL/SQL above prior to setting the job_queue_processes parameter in order to clear out a backlog of jobs (which can be time consuming if you have thousands or more) but realize that the final flush of records does appear to require that the value is set to zero.

When complete, I'd recommend collecting dictionary statistics, since the large number of jobs deleted could easily skew optimizer operations (and that's how I originally discovered we were victims of the bug).

Friday, December 8, 2017

Persistent Data Volumes for Oracle in Docker

A good use of Oracle on Docker is for building out test environments. Developers can spin up an Oracle database as part of an application stack in no time flat and get to working. That's great news for DBAs, particularly if you don't enjoy installing Oracle on dozens of development servers!

The only downside is that it's unlikely that the Docker build of Oracle will come with your data. What that means for developers is that after building out an Oracle installation, it needs to be populated with data. OK< no problem—that's what expdp and impdp are for, right?

Docker instances are ephemeral. Once you burn it down it's gone for good. And... so is your data that you spent so much time populating.

The solution to this is the put the data in a mapped volume outside the container. (For great tutorials on building an Oracle Docker container database, see Maria Colgan's tutorial and this follow-up by Steven Feuerstein.)

I was recently asked to build a gold image for our developers to use with Oracle 11 XE database installs and demonstrate that the external data volume was portable. In other words, that the data volume would work on other people's machines, in different locations, and when accessed by differently named containers, and thought I'd share the experience.

The following technique is applicable to any version of Oracle on Docker (11g, 12cR1, 12cR2) but I'll be demonstrating with the 11g XE install our developers wanted because of the licensing flexibility.

The first thing I encountered was a notably different experience between building an 11g XE container and my prior work with 12c on Docker. The 11g XE install has a swap space check that, on OS X, prevents the install from completing. An excellent blog post describing the steps to work around the issue is on Ottmar Gobrecht's blog.

With that out of the way I was able to build the 11g XE image and get to work.

The magic is all in how you run the Docker run command. There's a -v option for specifying the mapping between a path "inside" the container to a "real" path on the host. Everything "inside" the container is lost when it's removed but the data on the host volume "outside" the container will remain. This means I can build a container, create the volume on my local host, then zip it up and distribute it to others.

This is an example of the command I ran to create my container:

 docker run --name oracle-xe \  
 -p 1521:1521 \
 --shm-size=1G \
 -e ORACLE_PWD=manager \
 -v ~/oradata/XE:/u01/app/oracle/oradata \
 oracle/database:11.2.0.2-xe

The notable part is the -v directive. It's mapping /u01/app/oracle/oradata in the container to ~/oradata/XE on the host. The container path is defined in docker-images-master/OracleDatabase/dockerfiles/<db_version>/Dockerfile.<edition>. All of the configurations for the database (including the contents of $ORACLE_HOME/dbs, $ORACLE_HOME/network/admin, and all the controlfiles, datafiles, and redo log files) will be linked to that location. In other words, everything needed to start and run a database.

The initial build of an Oracle database in Docker will spawn the database creation. There's a section in the docker-images-master/OracleDatabase/dockerfiles/<db_version>/runOracle.sh script that checks for the existence of a database:
 if [ -d $ORACLE_BASE/oradata/$ORACLE_SID ]; then  
   symLinkFiles;  
   # Make sure audit file destination exists  
   if [ ! -d $ORACLE_BASE/admin/$ORACLE_SID/adump ]; then  
    su -p oracle -c "mkdir -p $ORACLE_BASE/admin/$ORACLE_SID/adump"  
   fi;  
 fi;  


Normally, when building a database, you'd see output or log entries similar to:
Copying database files
1% complete
25% complete
Creating and starting Oracle instance
26% complete
41% complete
Completing Database Creation
42% complete
50% complete
Creating Pluggable Databases
55% complete
75% complete
Executing Post Configuration Actions
100% complete

...eventually followed by the message:
#########################
DATABASE IS READY TO USE!
#########################

After this first build I added new datafiles and imported data into the database. Then, I dropped the container:
 docker rm --force oracle-xe  


I next rebuilt the container using a different name:
 docker run --name xe \  
 >      -p 1521:1521 \ 
 >      --shm-size=1G \ 
 >      -e ORACLE_PWD=manager \ 
 >      -v ~/oradata/XE:/u01/app/oracle/oradata \ 
 >      oracle/database:11.2.0.2-xe 
 #########################   
 DATABASE IS READY TO USE!   
 #########################   

Note that the first message from the build was "Database is ready use!" There was no database creation because the runOracle.sh script recognized the existence of the database at the specified location!

I achieved the same result after moving the files to a different location (and changing the local mapping path in the -v option accordingly) and when moving to a different machine. Voila! A portable, persistent data volume for your Oracle database on Docker!

Thursday, June 22, 2017

Sending messages from an Oracle database to Slack

Slack is becoming a standard tool for communication among DevOps teams. It's easily extensible and integrated into systems, providing a more flexible alerting structure.

Consider the legacy "pager duty" structure, where one or more members of a team get paged for an event. The pages are distributed but team members are disconnected, so you may not be aware that someone else is already fixing the problem without a second mechanism such as email or chat.

Slack blends the two, so it's a natural evolution to dispense with traditional pages (Slack will page you with WhatsApp-style messages anyway and can be customized) and integrate database messaging directly into Slack.

The process for configuring the integration takes less than an hour.

Set up Webhooks integration on Slack

Go to slack.com and sign up for an account if you don't already have one. It's free.

Once you're in Slack, create a new channel (or use an existing one) for messaging. I chose to create a private channel called #oracle. A private channel allows you to control membership. A public channel can be joined by anyone in your Slack domain.



With the channel created, click on the Settings gear at the top. You'll get a drop-down dialog; select "Add an app or integration".



This will take you to Slack's app and integration page. In the search box, type "Incoming Webhooks" and choose the Incoming Webhooks integration.



Click on "Add Configuration". This will take you to the configuration options, where you'll add the channel you want to send notifications to.



The next screen will provide your Webhook URL. Treat this like a password—it's the URL you'll use to send information from the database to Slack. This page also gives you basic information on how to send and format JSON to Slack.

 

You'll have further options to customize your channel and give it meaning on the following page:



Configure an Oracle Wallet and add the Slack Certificate

Slack uses SSL (HTTPS) so you can't just send web requests directly from Oracle. You have to set up a Wallet and import the Slack certificate.

First create a Wallet:

mkdir -p /oracle/wallet
orapki wallet create -wallet /oracle/wallet -pwd suPerStr0ngP@ssword -auto_login

Go to Slack.com and import the certificate. The procedure will depend on your operating system and browser. In Safari on Mac, navigate to slack.com and click on the lock icon beside the domain. Then click and drag the certificate to your Desktop.



Copy the certificate to your database server and note the location. For this deomonstration I added the certificate to /oracle/wallet. You could easily put it in a shared location, particularly if you have multiple database servers that you'll be configuring as it will save you having to copy the certificate to each machine.

Now, add the certificate to the Wallet:

orapki wallet add -wallet /oracle/wallet -trusted_cert -cert "/oracle/wallet/GeoTrustGlobalCA.cer" -pwd suPerStr0ngP@ssword

Send JSON from Oracle to Slack

With the certificate imported into the Wallet, all we need to do now is send a POST request to the Webhooks URL with the JSON message we want to have reported in Slack.

  declare
          req                  utl_http.req;
          res                  utl_http.resp;
          url                  varchar2(4000) := 'https://hooks.slack.com/services/your_webhook_url_goes_here';
          name                 varchar2(4000);
          buffer               varchar2(4000); 
          content              varchar2(4000);
          db_name              v$database.name%TYPE;
    begin

          utl_http.set_wallet('file:/oracle/wallet', 'suPerStr0ngP@ssword');

           select name
             into db_name
             from v$database;

         content := '{"channel": "#oracle", "username": "orabot", "text": "Test post to the #oracle channel from database ' || db_name || '", "icon_emoji": ":oracle:"}';req := utl_http.begin_request(url, 'POST',' HTTP/1.1');
         utl_http.set_header(req, 'user-agent', 'mozilla/4.0'); 
         utl_http.set_header(req, 'content-type', 'application/json'); 
         utl_http.set_header(req, 'Content-Length', length(content));

         utl_http.write_text(req, content);

         res := utl_http.get_response(req);

             begin
              loop
                   utl_http.read_line(res, buffer);
                   dbms_output.put_line(buffer);
          end loop;
                   utl_http.end_response(res);
         exception
              when utl_http.end_of_body
              then utl_http.end_response(res);
               end;
     end;
/

The end result is the message being delivered to Slack:

Tuesday, December 6, 2016

Adding a schema to an active SharePlex configuration

In this post I'll address how to add a new schema to an active SharePlex configuration. This procedure doesn't require you to recreate your entire replication topology. It can be used to push data from one or more schemas in your source database to an existing or new target database. It can also be used to add replication for a schema that you're already replicating to a new target.
This post assumes some familiarity with SharePlex and an understanding of the basic architecture, filesystem layout and terminology. Examples are from an installation of SharePlex 8.6.2 on RHEL 6.6 replicating an Oracle 11.2.0.4 SE database, but should be applicable to other versions as well.

Overview

There are a handful of steps to be performed:
  • Copy the active configuration to a new file
  • Edit and validate the new configuration
  • Synchronize the new objects
  • Activate the new configuration

Copy the active configuration

Never edit an active configuration file! Instead, use the sp_ctrl utility to make a copy of the configuration:
copy config prod_config1 to prod_config2

Edit the new configuration

Still working within sp_ctrl, edit the configuration with:
edit config prod_config2
This will open an editor window where you can make changes to your configuration. At this point you'll be able to add new schemas or targets to replication.
When complete, save the changes and validate the configuration:
verify config prod_config2
It's very important that the file is verified as valid:
sp_ctrl (crbstgspx:2100)> verify config prod_config2

Config prod_config2 is valid
View detailed config in /u01/splex/var/temp/prod_config2

And not:
Verified config file too large to display
...which indicates an issue in the configuration that needs to be addressed. Absence of an actual error or warning message is a little misleading here.

Synchronize the new objects

Export or otherwise copy the newly added objects to the target database. I'll cover methods for this in a separate post.

Activate the new configuration

Activate the new configuration through sp_ctrl:
activate config prod_config2

Tuesday, May 24, 2016

Improving the performance of ORDER_LASTMOD_IDX on DCSPP_ORDER in ATG

I just stumbled on Metalink article 1603359.1, which recommends that ATG customers encountering contention on the ORDER_LASTMOD_IDX index on table DCSPP_ORDER drop the index. It's noted that this is especially likely in RAC and Exadata.

The indexed column, LAST_MODIFIED_DATE (a TIMESTAMP), frequently changes for an order and every order being changed will be adding or updating nearly the same value. This means that every session will be attempting to add data in the same index leaf, meaning there will be heavy competition for that block in RAC as each node attempts to capture it.

The index will also grow significantly faster than its data. Say an order goes through a series of ten updates; each will require that the "old" leaf be cleared of its value and a new one added further "up" the index (we're talking about a timestamp after all). An order updated ten times will leave nine empty leaves behind, which will never be reclaimed since a record will never be updated to a past time.

This can certainly be dealt with by rebuilding the index periodically, but it doesn't solve the contention issue. For shops using Standard Edition, rebuilding an index comes at the cost of locking the table, effectively causing an outage during the maintenance. Dropping the index comes at the expense of poor performance of some ATG queries that search on LAST_MODIFIED_DATE.

But there's another solution—rebuilding the index with the REVERSE keyword.

A reverse key on the index will spread activity across ten "hot" blocks instead of just one. That's because, instead of entering a timestamp based on the incremental time with the least volatile components (Year, month, day) occupying the most volatile spots (the leading edge of the index) a reverse index on this column will put the most volatile portion first (the seconds/milliseconds) and the least volatile last.

The index also won't fragment, either, since the "empty" spots in the index can be reused. Under the default configuration we would see this index grow to 11G or more in a matter of months, when the table itself was under 3G. Since converting the index to be a reverse key, it's remained at a much more stable (and manageable) 1G despite the table growing to over 5G.

Waits, in particular Global Cache waits associated with RAC attempting to obtain a cached block from another node, virtually disappeared for us.

The only downside is that any BETWEEN condition against the LAST_MODIFIED_DATE will not use the index, since a reverse key index can't be used for a range scan.

Friday, March 11, 2016

A CRS Action Script for Solarwinds Database Performance Analyzer (DPA)

I've been using Solarwinds DPA for a few years now and find it to be a great tool for monitoring Oracle databases. It's one of the few GUI tools I actually enjoy working with. The interface is clean and provides great information without being cluttered and the navigation is intuitive.

We first installed DPA on a trial basis. It went on a box that, well, let's just say it was a place I could install some trial software without having to put a lot of effort into things. It worked out and we licensed it. In place. Over time, we've become more reliant on it for monitoring our Oracle databases. But it was still on a junker, single node database server.

A junker, single node database server that was about to go out of support.

We decided to move it to a RAC cluster. We'd get gaps in our monitoring when we applied database patches, or when the node had to be rebooted, or when the node just took a dump and RAC seemed like a nice solution.

DPA is really just a tomcat application that uses an Oracle database as a repository. With Grid Infrastructure it's possible to manage a resource like tomcat within CRS. This would mean DPA could be dependent on its database resource, start automatically on a server reboot, and on node failure it would relocate to another node.

For those that don't know, CRS manages resource through an action script. This is just a shell script that CRS runs that has options for starting, stopping, checking and cleaning a resource.

I figured Solarwinds would have a canned action script. Turns out they didn't, so I rolled my own. Here it is for anyone that's interested.

First is the action script itself:

 #!/bin/sh  
   
 # Startup, shutdown, clean and check script for SolarWinds DPA (Database Performance Analyzer) processes.  
   
 . /oracle/dpa/setenv.sh  
 logfile=$DPA_LOGS/dpactl.log  
   
 usage()  
 {  
  echo "Usage: $0 start|stop|check|clean"  
  exit 1  
 }  
   
 check_dpa()  
 {  
  check=$(ps -ef | grep $DPA_HOME | grep tomcat | grep -v grep | wc -l)   
  if [ $check -eq 1 ]  
   then exit 0  
   else exit 1  
  fi  
 }  
   
 getnow()  
 {  
  now=$(date '+%m/%d/%y %H:%M:%S')  
 }  
   
 case "$1" in  
    'start') getnow  
        echo "Starting Solarwinds DPA at ${now}." | tee -a $logfile  
        echo "This will take a few moments. Please be patient..." | tee -a $logfile  
        $DPA_HOME/startup.sh | tee -a $logfile  
   
        check_dpa  
        if [ $check -eq 1 ]  
         then getnow  
            echo "Solarwinds DPA started successfully at ${now}." | tee -a $logfile  
         else echo "There was a problem starting Solarwinds DPA!" | tee -a $logfile  
        fi  
    ;;  
   
    'stop') getnow  
        echo "Stopping Solarwinds DPA at ${now}." | tee -a $logfile  
        echo "This will take a few moments. Please be patient..." | tee -a $logfile  
        $DPA_HOME/shutdown.sh | tee -a $logfile  
        EXITCODE=$?  
        check_dpa  
   
        if [ $EXITCODE == 1 ] || [ $check -gt 0 ]  
         then error "There was a problem stopping Solarwinds DPA!" | tee -a $logfile  
        fi  
   
        getnow  
        echo "Solarwinds DPA stopped successfully at ${now}." | tee -a $logfile  
        exit 0  
    ;;  
   
    'check') check_dpa  
    ;;  
   
    'clean') dpa=$(ps -ef | grep $DPA_HOME | grep tomcat | grep -v grep | awk '{print $2}')  
        if [ ! -z $dpa ]   
         then kill -9 -f $dpa  
            exit 0  
         else exit 1  
        fi  
    ;;  
   
    *) usage  
    ;;  
   
 esac  

Next is a script that sets the environment, called at the beginning of the action script. It avoids having to hard-code directory paths into the action script itself.

 export DPA_HOME=/oracle/dpa/dpa_10_1_313  
 export DPA_LOGS=/oracle/dpa/logs  

Finally, there are the commands to create the Grid/CRS resource, as well as starting, stopping and relocating it.

 $CRS_HOME/bin/crsctl add resource solarwinds_DPA -type cluster_resource -attr "ACTION_SCRIPT=/path/to/script/solarwinds_dpa.sh, DESCRIPTION='SolarWinds Database Performance Analyzer', DEGREE=1, ENABLED=1, AUTO_START=always, START_TIMEOUT=0, CHECK_INTERVAL=60, STOP_TIMEOUT=0, SCRIPT_TIMEOUT=60, RESTART_ATTEMPTS=2, OFFLINE_CHECK_INTERVAL=0, START_DEPENDENCIES='hard(ora.repo.db)', ACL='owner:oracle:rwx,pgrp:oinstall:rwx,other::r--', CARDINALITY=1, FAILURE_INTERVAL=0, FAILURE_THRESHOLD=0"  
 $CRS_HOME/bin/crsctl start resource solarwinds_DPA  
 $CRS_HOME/bin/crsctl relocate resource solarwinds_DPA -n node2  

You'll need to install DPA on all nodes of the cluster, or in a shared directory. You'll also need a directory for the startup/shutdown logs.

Tuesday, March 8, 2016

The correct answer to the interview question, "Have you ever had to restore using a backup?"

I'm constantly amazed by the answers DBA candidates give to the question, "When was the last time you had to restore using a backup?" The most common is, "Our environment is pretty stable/robust/well engineered, we've never had to do that!"

I assume it's meant to indirectly demonstrate their capabilities as a DBA, engineer or architect, and it's great that you've never had a production situation that required you to recover.

But what happens when you do have to recover.

Recently, a candidate told me that they hadn't had to recover but the process was well documented.

Right.

I wondered, "When?" Documents can be incomplete. Processes change over time. Systems are upgraded and improved and migrated to new hardware. That documentation may have been accurate when written, but the evolution of the environment may have made it confusing, misleading or even wrong. It may have made sense to the person who wrote it but it may contain assumptions or be missing steps. You may not want to discover any of this with the CEO standing over your shoulder.

Recovery is arguably the most stressful situation a DBA may encounter in their career. A slow-performing database is still running, but when you're down waiting for recovery to complete, all eyes are on the DBA. A multitude of variables add to the pressure—do I have all the files I need, did I remember to set the environment correctly, is there enough space on the recovery system?

Most DBAs who haven't had to perform a production recovery don't understand that the situation is rarely ideal. Databases, like kids and pets, only get sick after the doctor's office has closed for a long holiday weekend. (The most significant recovery I've performed began late on the Friday of Memorial Day weekend. I was alerted as I was walking out the door for a trip with my wife and kids.)

Consider that during a restore you may be at the mercy of others. You may have to rely on storage admins to make space available, network admins to open routes, system admins to build hosts from bare metal, colocation facilities to grant physical and virtual access...

Why would you leave any of that to chance?

The correct answer to the question, "Have you ever had to restore using a backup?" is (minimally) "Yes, during the last quarter I performed a restore as part of a regular readiness exercise."

If your employer doesn't require this, fight to make it their policy. At the very least you'll be able to say that you take backup and recovery seriously and demonstrate a commitment to keeping an environment stable and highly available.

DBAs that aren't practicing recovery are akin to soccer or hockey goalies who never practice defending penalty shots. The heat of the moment is not the time to be figuring it out!