Monday, February 19, 2018

Rapidly increasing numbers of scheduler jobs in Oracle (11gR2 edition)

I recently encountered an interesting issue in our Oracle 11gR2 databases (it occurs in 12c also) where the number of scheduler jobs increases without check. If you're running DBMS_SCHEDULER jobs this could be impacting you. I'm a little embarrassed to admit that I didn't discover the root cause at first, instead I realized that queries on DBA_OBJECTS were performing poorly. Scheduler jobs are objects, and, well, after thousands were populated in DBA_OBJECTS we hit a threshold where the execution plan changed on these queries. Digging backwards I realized that there were a ton of rows with object_type = 'JOB'.

The issue is described in Metalink document 2001165.1 and caused by multiple bugs. I ended up running a slightly different solution than recommended. This is partly because the described solution isn't quite foolproof and partly that PL/SQL seems, to me, a better, faster, and more elegant solution than the recommended method of writing dynamic SQL to a file and then executing the file.

I solved it with:

alter system set job_queue_processes=0 scope=memory sid='*';

     begin
       for c1 in (
              select job_name, state
                from dba_scheduler_jobs
               where job_name like 'AQ$_PLSQL_NTFN_%')
      loop

                begin
                   if c1.state = 'RUNNING'
                 then dbms_scheduler.stop_job(c1.job_name);
              end if;
                      dbms_scheduler.drop_job(c1.job_name);
            exception
                 when others then null;
                   end;
  end loop;
       end;
/

Feel free to put something more meaningful than a "when others then null" exception handler here. I was lazy!

Then apply the recommended patch and bounce the instance. In a RAC environment the patch can be applied and instance restarts rolled across the nodes, but ideally don't revert job_queue_processes on a node until it has been restarted, else you'll see a new batch of jobs accumulating to be purged.

NOTE: I did try applying the patch in one environment prior to setting the job_queue_process parameter and purging the orphaned records, but the issue persisted. It appears to be sensitive to the order:

  • Set job_queue_processes=0
  • Purge the records
  • Apply the patch
  • Restart the instance and restore the value of job_queue_processes (automatic if you use scope=memory)
This will mean that the job scheduler will not be running for the duration of patching. In a single instance system you have downtime from the restart but in a RAC environment jobs will be suspended only for the duration of the code run, patching, and restart. It is possible to run the PL/SQL above prior to setting the job_queue_processes parameter in order to clear out a backlog of jobs (which can be time consuming if you have thousands or more) but realize that the final flush of records does appear to require that the value is set to zero.

When complete, I'd recommend collecting dictionary statistics, since the large number of jobs deleted could easily skew optimizer operations (and that's how I originally discovered we were victims of the bug).