Tag: database

  • Applying Oracle 19c Apr 2026 CPU Patch (19.31) on a Single Instance CDB

    In this article, we are going to demonstrate the steps to apply the Oracle April 2026 Critical Patch Update (CPU) on a Single Instance Oracle 19c Container Database (CDB). Here we will apply combo patch 39062931 which contains 39034528 (Database Apr 2026 Release Update 19.31.0.0.260421) and 38906621 (OJVM Component Release Update 19.31.0.0.260421). We also cover an important real-world scenario where a previously applied overlay patch 34672698 (ORA-00800 fix) conflicted with the Apr 2026 RU — confirming that the fix is now natively included in the RU itself.

    Follow the high-level plan to apply patch 39062931 (19.31.0.0.260421).

    • Go through the README documents of patch IDs before proceeding with patch apply.
    • Download the required patches from My Oracle Support.
    • Upgrade OPatch utility to the latest version as required by the patch README.
    • Perform pre-checks at CDB and PDB level — invalid objects, patch history, registry components.
    • Take Oracle Home TAR backup.
    • Blackout in OEM and disable crontab jobs before maintenance.
    • Shutdown database and listener.
    • Apply DB Release Update (39034528).
    • Apply OJVM Release Update (38906621).
    • Validate overlay patch (34672698) — check if still required post RU upgrade.
    • Start database and listener.
    • Execute datapatch on the database and validate.
    • Uncomment crontab and remove OEM blackout.
    • Perform post-checks and compare with pre-check data.

    Environment Details:

    COMPONENTS DETAILS
    Oracle Home /u01/app/oracle/product/19c/db_1
    DB Version 19.30.0.0.0 (Jan 2026 RU)
    OPatch Version (Before) 12.2.0.1.49
    OPatch Version (After) 12.2.0.1.51
    DB Type Single Instance, Multitenant (CDB+PDB)
    CDB Name DEVCDB
    PDBs DEVPDB1, DEVPDB2
    OS Linux x86-64
    Combo Patch 39062931 (19.31.0.0.260421)
    DB RU Patch 39034528
    OJVM Patch 38906621
    Overlay Patch 34672698 (validated — not reapplied)

    1. Pre-Checks

    1.1: Capture invalid objects at CDB level.

    sqlplus / as sysdba
    
    Set lines 230
    set pages 100
    col OBJECT_NAME for a40
    col OBJECT_TYPE for a40
    col OWNER for a20
    select OBJECT_NAME, OBJECT_TYPE, OWNER,
           to_char(LAST_DDL_TIME,'DD-MON-YYYY HH24:MI:SS'), status
    from dba_objects
    where status='INVALID'
    order by owner;
    no rows selected

    1.2: Capture patch history from dba_registry_sqlpatch.

    select patch_id, action, status, action_time, description
    from dba_registry_sqlpatch;
    PATCH_ID ACTION STATUS ACTION_TIME DESCRIPTION
    38523609 APPLY SUCCESS 03-FEB-26 OJVM RELEASE UPDATE: 19.30.0.0.260120
    38632161 APPLY SUCCESS 03-FEB-26 Database Release Update: 19.30.0.0.260120

    44 rows selected.

    1.3: Capture DB registry components.

    SELECT COMP_ID, comp_name, version, status FROM dba_registry;
    COMP_ID COMP_NAME VERSION STATUS
    CATALOG Oracle Database Catalog Views 19.30.0.0.0 VALID
    CATPROC Oracle Database Packages and Types 19.30.0.0.0 VALID
    JAVAVM JServer JAVA Virtual Machine 19.30.0.0.0 VALID
    XML Oracle XDK 19.30.0.0.0 VALID
    XDB Oracle XML Database 19.30.0.0.0 VALID
    ORDIM Oracle Multimedia 19.30.0.0.0 VALID
    CONTEXT Oracle Text 19.30.0.0.0 VALID
    … 15 rows total. All components VALID.

    1.4: Connect to each PDB and capture invalid objects as baseline.

    alter session set container=DEVPDB1;
    select OBJECT_NAME, OBJECT_TYPE, OWNER, status
    from dba_objects where status='INVALID'
    order by owner;
    📌 Note: Document all pre-existing invalid objects at PDB level before patching. These serve as a baseline for post-patch comparison. Do not attempt to fix pre-existing invalid objects before patching.

    1.5: Capture current patch inventory from OS.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    34672698 ORA-00800 SOFT EXTERNAL ERROR, ARGUMENTS [SET PRIORITY FAILED], [VKTM], DISM(16)
    38523609 OJVM RELEASE UPDATE: 19.30.0.0.260120 (38523609)
    38632161 Database Release Update: 19.30.0.0.260120 (38632161)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.

    2. Download Patches

    2.1: Create a dedicated directory for April 2026 patches.

    mkdir -p /u01/app/oracle/patches/APR_2026
    cd /u01/app/oracle/patches/APR_2026

    2.2: Download OPatch (patch 6880880).

    ⚠️ Important: Always download the Linux-x86-64 version of OPatch, not the generic LINUX (32-bit) version. The 32-bit LINUX zip will cause oui/lib/linux directory errors during patch apply.
    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p6880880_190000_Linux-x86-64.zip

    2.3: Download Combo Patch and Overlay Patch.

    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p39062931_190000_Linux-x86-64.zip
    
    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p34672698_1930000DBRU_Linux-x86-64.zip

    2.4: Unzip patches and verify structure.

    unzip p39062931_190000_Linux-x86-64.zip
    unzip p34672698_1930000DBRU_Linux-x86-64.zip
    
    ls -l 39062931/
    drwxr-x---. 4 oracle oinstall  6144 38906621    -- OJVM RU
    drwxr-x---. 4 oracle oinstall  6144 39034528    -- DB RU
    -rw-r--r--. 1 oracle oinstall   118 README.html

    3. Upgrade OPatch

    3.1: Check current OPatch version.

    [oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
    OPatch Version: 12.2.0.1.49
    OPatch succeeded.

    The April 2026 RU requires OPatch 12.2.0.1.51 or higher. We need to upgrade.

    3.2: Backup and replace OPatch.

    mv $ORACLE_HOME/OPatch $ORACLE_HOME/OPatch_old_12.2.0.1.49
    unzip p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME

    3.3: Verify new OPatch version.

    [oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
    OPatch Version: 12.2.0.1.51
    OPatch succeeded.
    💡 Tip: If OPatch fails with Invalid maximum heap size: -Xmx5120m, set the following before retrying:

    export OPATCH_JRE_MEMORY_OPTIONS="-Xmx1024m"

    Add this to ~/.bash_profile to make it permanent.


    4. Pre-Patching Activities

    4.1: Comment out crontab entries to prevent jobs from running during maintenance.

    [oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/^/##APR_2026##&/g' | /usr/bin/crontab)

    4.2: Place OEM blackout if OEM is configured for this target.

    📌 Note: If the environment has application-level database triggers, disable them before shutting down the database. Consult the application team before making any trigger changes. Re-enable them after patching is complete.

    5. Shutdown Database and Listener

    5.1: Shutdown the database and listener, and take an Oracle Home TAR backup.

    sqlplus / as sysdba
    SQL> shutdown immediate;
    Database closed.
    Database dismounted.
    ORACLE instance shut down.
    SQL> exit;
    
    lsnrctl stop LISTENER
    
    # Oracle Home TAR backup (run in background)
    nohup tar -zcvf /u01/app/oracle/patches/oracle_home_APR2026.tar.gz $ORACLE_HOME &

    6. Apply DB Release Update (39034528)

    6.1: Set PATH and run the prerequisite conflict check.

    export PATH=$ORACLE_HOME/OPatch:$PATH
    cd /u01/app/oracle/patches/APR_2026/39062931/39034528
    
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    ZOP-47: Superset Patch 39034528 has Subset Patch 38632161 which has overlay patches
    [34672698] and these overlay patches conflict with Superset Patch.
    OPatch failed with error code 73
    ⚠️ Action Required: The prereq check identified that overlay patch 34672698 conflicts with the new DB RU. Roll it back before applying the DB RU.

    6.2: Rollback overlay patch 34672698.

    opatch rollback -id 34672698
    Patch 34672698 successfully rolled back.
    OPatch succeeded.

    6.3: Re-run prereq check after rollback.

    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    Prereq "checkConflictAgainstOHWithDetail" passed.
    OPatch succeeded.

    6.4: Apply the DB Release Update.

    opatch apply
    Patch 39034528 successfully applied.
    Sub-set patch [38632161] has become inactive due to the application of a super-set patch [39034528].
    OPatch succeeded.

    6.5: Verify the patch is applied.

    opatch lsinventory | grep 39034528
    39034528  Database Release Update : 19.31.0.0.260421

    7. Apply OJVM Release Update (38906621)

    7.1: Run prereq check and apply the OJVM patch.

    export PATH=$ORACLE_HOME/OPatch:$PATH
    cd /u01/app/oracle/patches/APR_2026/39062931/38906621
    
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    Prereq "checkConflictAgainstOHWithDetail" passed.
    OPatch succeeded.
    
    opatch apply
    Patch 38906621 successfully applied.
    Sub-set patch [38523609] has become inactive due to the application of a super-set patch [38906621].
    OPatch succeeded.

    7.2: Verify both patches are listed.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    39034528 Database Release Update: 19.31.0.0.260421 (39034528)
    38906621 OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.

    8. Overlay Patch (34672698) — Conflict with Apr 2026 RU

    The overlay patch 34672698 was previously applied on top of the Jan 2026 RU to fix ORA-00800: SOFT EXTERNAL ERROR, ARGUMENTS: [SET PRIORITY FAILED], [VKTM], DISM(16). It was rolled back in Step 6. Before reapplying, always run the prereq check.

    📌 Note on oradism permissions: Some environments require a temporary oradism ownership change to oracle before applying the overlay patch. Always check the current permissions first:

    ls -l $ORACLE_HOME/bin/oradism

    If oradism is already oracle-owned (e.g., -rwxr-x---. 1 oracle oinstall), no root permission change is required. Only apply the chown/chmod steps if oradism is root-owned with setuid bit.

    8.1: Run prereq check before reapplying the overlay patch.

    cd /u01/app/oracle/patches/APR_2026/34672698
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    
    ZOP-40: The patch(es) has conflicts with other patches installed in the Oracle Home.
    Following patches have conflicts: 34672698, 39034528
    OPatch failed with error code 73
    ✔ Expected Result: The prereq confirms overlay patch 34672698 conflicts with Apr 2026 DB RU (39034528). Oracle has included the ORA-00800 fix natively in the Apr 2026 RU. The overlay patch is no longer required and must NOT be reapplied.
    ⚠️ Important: This is expected behaviour when upgrading RUs. An overlay patch applied on a previous RU may become redundant when the next RU natively includes the same fix. Always run prereq check before reapplying overlay patches after an RU upgrade.

    8.2: Verify the overlay patch is absent from inventory.

    opatch lsinventory | grep 34672698
    [oracle@dbhost 34672698]$
    ✔ Confirmed: No output from the grep confirms the overlay patch is correctly absent from the Oracle Home.

    9. Start Database and Run Datapatch

    9.1: Start the database and open all PDBs.

    sqlplus / as sysdba
    SQL> startup;
    ORACLE instance started.
    Database mounted.
    Database opened.
    
    SQL> alter pluggable database all open;
    Pluggable database altered.
    
    SQL> exit;
    
    lsnrctl start LISTENER

    9.2: Run datapatch to apply the SQL portion of the patches.

    cd $ORACLE_HOME/OPatch
    ./datapatch -verbose
    SQL Patching tool version 19.31.0.0.0 Production on Tue May 5 13:17:39 2026
    
    Installation queue:
      For CDB$ROOT PDB$SEED DEVPDB1 DEVPDB2:
        Rollback: 38523609 (OJVM 19.30)
        Apply: 39034528 (DB RU 19.31)
        Apply: 38906621 (OJVM 19.31)
    
    Patch 38523609 rollback (CDB$ROOT): SUCCESS
    Patch 39034528 apply (CDB$ROOT): SUCCESS
    Patch 38906621 apply (CDB$ROOT): SUCCESS
    Patch 38523609 rollback (DEVPDB1): SUCCESS
    Patch 39034528 apply (DEVPDB1): SUCCESS
    Patch 38906621 apply (DEVPDB1): SUCCESS
    Patch 38523609 rollback (DEVPDB2): SUCCESS
    Patch 39034528 apply (DEVPDB2): SUCCESS
    Patch 38906621 apply (DEVPDB2): SUCCESS
    
    SQL Patching tool complete on Tue May 5 13:24:18 2026
    📌 Note: Datapatch applies the SQL changes to both CDB and all open PDBs. Wait for it to complete fully before proceeding. Do not interrupt the process.

    10. Post-Patching Activities

    10.1: Uncomment crontab entries.

    [oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/##APR_2026##//g' | /usr/bin/crontab)

    10.2: Remove OEM blackout if applicable.

    📌 Note: If application-level database triggers were disabled before patching, re-enable them now after confirming datapatch completed successfully. Consult the application team before enabling.

    11. Post-Checks

    11.1: Verify patch registry — all entries should show SUCCESS.

    select patch_id, action, status, action_time, description
    from dba_registry_sqlpatch
    order by action_time;
    PATCH_ID ACTION STATUS ACTION_TIME DESCRIPTION
    38523609 ROLLBACK SUCCESS 05-MAY-26 01:20:47 OJVM RELEASE UPDATE: 19.30.0.0.260120
    38906621 APPLY SUCCESS 05-MAY-26 01:21:02 OJVM RELEASE UPDATE: 19.31.0.0.260421
    39034528 APPLY SUCCESS 05-MAY-26 01:21:56 Database Release Update: 19.31.0.0.260421

    11.2: Verify all DB registry components are VALID.

    SELECT COMP_ID, comp_name, version, status FROM dba_registry;
    COMP_ID COMP_NAME VERSION STATUS
    CATALOG Oracle Database Catalog Views 19.31.0.0.0 VALID
    CATPROC Oracle Database Packages and Types 19.31.0.0.0 VALID
    JAVAVM JServer JAVA Virtual Machine 19.31.0.0.0 VALID
    XML Oracle XDK 19.31.0.0.0 VALID
    XDB Oracle XML Database 19.31.0.0.0 VALID
    ORDIM Oracle Multimedia 19.31.0.0.0 VALID
    CONTEXT Oracle Text 19.31.0.0.0 VALID
    … 15 rows total. All components VALID.

    11.3: Confirm final patch inventory.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    39034528 Database Release Update: 19.31.0.0.260421 (39034528)
    38906621 OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.
    ✔ Confirmed: Overlay patch 34672698 is no longer listed. Its fix is now natively included in the Apr 2026 RU (39034528). This is the expected and correct final state.

    11.4: Compare invalid objects at CDB and PDB level with pre-patch baseline. Compile any new invalid objects introduced by the patch:

    @$ORACLE_HOME/rdbms/admin/utlrp.sql

    Key MOS References

    Document Description
    KB869205 Oracle Database 19c Apr 2026 RU Known Issues
    KB137197 OJVM Conditional Rolling Install Details
    244241.1 OPatch Support for RAC Rolling Patches
    293369.1 OPatch Documentation List

    Disclaimer: All server names, hostnames, database names, and environment-specific paths in this post have been anonymized. Steps and outputs are based on real production experience adapted for general use. Always test in a non-production environment before applying patches to production systems.

    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More patching walkthroughs at syedanwarahmedoracle.blog.


    💬 Share Your Experience

    Have you applied Oracle 19c CPU patches and hit a similar overlay patch conflict? Or found that a fix you applied months ago is now natively bundled into the next RU? Your real-world experiences help fellow DBAs tackle the same challenges. Drop a comment below — questions, observations, and feedback are always welcome.

  • P1 Incident: /dbbackup Filesystem 100% Full — How We Traced, Fixed, and Recovered Two Failed RMAN Backups in One Night

    🔴 Incident Overview

    Severity P1 — Two production database backups failed
    Environment Oracle 19c (19.30) on Linux x86-64
    Backup Tool RMAN with Recovery Catalog
    Backup Volume /dbbackup — 1TB LVM filesystem
    Databases Affected DBPRO01 (12:30 failure), DBPRO02 (22:30 failure)
    Total Space Recovered ~492G (disk went from 100% → 52%)

    1. The Alerts — Two Failures, Same Root Cause

    It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:

    RMAN-03009: failure of backup command on c4 channel at 22:32:13
    ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
    ORA-27072: File I/O error
    Additional information: 4
    

    Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.

    A quick check confirmed it:

    $ df -hP /dbbackup
    Filesystem                        Size  Used Avail Use%
    /dev/mapper/orabkupvg-orabkuplv1 1023G 1020G  3.3G 100%
    

    1TB volume. 3.3G free. Completely full.

    What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.


    2. The Investigation — Folder by Folder

    The first step was understanding what was consuming the disk. One command gave us the top-level picture:

    $ du -sh /dbbackup/*
    744G    DBPRC01
    152G    ColdBackup_11April2026
     41G    DBPRC02
     26G    DBPRC03
     15G    DBPRO01
    7.7G    JAN2026_CPU
    7.7G    OCT2025_CPU
    5.3G    infra_arch
    

    744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.

    Drilling into DBPRC01

    $ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
    7.5G    DiffInc_DBPRC01_fl4lumij
    7.5G    DiffInc_DBPRC01_eg4lrnh5
    7.5G    DiffInc_DBPRC01_b44lc7uh
    ...
    

    Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.

    $ ls /dbbackup/DBPRC01/rman/ | wc -l
    1522
    

    1,522 backup pieces. We checked the oldest and newest:

    Oldest file on disk:  2022-05-07
    Newest file on disk:  2026-04-25
    

    Four years of backup files on disk — or so we thought.


    3. The RMAN Investigation — Where Things Got Interesting

    We connected RMAN to the database and ran the retention check:

    RMAN> SHOW RETENTION POLICY;
    CONFIGURE RETENTION POLICY TO REDUNDANCY 30;
    

    REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.

    Next logical step — check what RMAN considers obsolete:

    RMAN> REPORT OBSOLETE;
    no obsolete backups found
    

    Nothing? With 1,522 files on disk?

    We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:

    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    OLDEST          NEWEST          TOTAL_PIECES
    03-DEC-2025     25-APR-2026     1541
    

    The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.

    This was the key insight: RMAN was not broken. The retention policy itself was the problem.


    4. Root Cause — The Architecture Trap

    The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:

    connect target rman/password@DBPRO01
    ...
    format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
    (database);
    ...
    delete obsolete;
    

    DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.

    This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.

    The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.

    The math:

    Retention policy REDUNDANCY 30
    Level 0 frequency Weekly (Sundays)
    Effective retention period ~30 weeks / 7.5 months
    Backup tracking since December 2025 (~5 months)
    Result delete obsolete finds nothing — ever
    Daily backup size ~7–7.5G per run
    Total accumulated 744G

    Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.


    5. Secondary Findings During Investigation

    OCT2025 CPU Patch Artifacts (7.7G)

    The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.

    $ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
    # Empty — neither Oct 2025 patch in inventory anymore
    

    5-Year-Old Pre-Migration Export Dumps

    Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.

    Recovery Catalog Version Mismatch

    The original RMAN log flagged this warning:

    PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
    PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current
    

    The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.


    6. The Fix — Emergency Space Recovery

    With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:

    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    

    This command does three things atomically:

    1. Queries catalog/controlfile for all pieces completed before the cutoff date
    2. Deletes the physical files from disk
    3. Removes the records from RMAN catalog — no orphaned entries, no catalog drift

    The output scrolled for several minutes:

    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
    ...
    Deleted 1072 objects
    

    1,072 backup pieces deleted. Catalog updated. Disk checked:

    BEFORE:  Used 1020G  Avail 3.3G  (100%)
    AFTER:   Used  536G  Avail 488G   (53%)
    

    Then the OCT2025_CPU directory was removed:

    $ rm -rf /dbbackup/OCT2025_CPU/
    $ df -hP /dbbackup
    Used 528G  Avail 495G  (52%)
    

    Final result: 495G free. Disk at 52%.

    Both failed backups were re-submitted immediately and ran successfully in parallel:

    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &
    
    $ jobs -l
    [1] Running   nohup sh ...rman_backup_DBPRO01.sh &
    [2] Running   nohup sh ...rman_backup_DBPRO02.sh &
    

    7. Incident Timeline

    12:30 DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full)
    22:30 DBPRO02 Level 0 backup fails — same errors, all 4 channels
    23:08 Investigation begins — df -hP /dbbackup confirms 100% full
    23:15 DBPRC01 directory identified as 744G consumer
    23:25 RMAN connected — REDUNDANCY 30 discovered
    23:35 Architecture confirmed — PRO databases backing up into PRC directories
    23:45 Root cause confirmed — 7.5-month retention, delete obsolete finds nothing
    23:50 DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed
    23:51 1,072 pieces deleted — disk drops to 53%
    23:55 OCT2025_CPU removed — disk at 52%, 495G free
    00:00 Both backup jobs re-submitted and running successfully

    8. Permanent Fix Recommendations

    Fix 1 — Change Retention Policy to RECOVERY WINDOW

    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    

    REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.

    Fix 2 — Add Pre-Backup Space Check to Cron Script

    #!/bin/bash
    BACKUP_FS="/dbbackup"
    THRESHOLD=20
    
    AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')
    
    if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
      echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
        | mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
      exit 1
    fi
    

    A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.

    Fix 3 — Upgrade the Recovery Catalog

    RMAN> CONNECT TARGET /
    RMAN> CONNECT CATALOG rman/password@rmancat
    RMAN> UPGRADE CATALOG;
    RMAN> UPGRADE CATALOG;   -- run twice as prompted
    

    The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.

    Fix 4 — Filesystem Monitoring Alert

    The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.


    9. Key Takeaways for Oracle DBAs

    REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.

    • Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
    • Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
    • Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
    • Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
    • Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.

    10. Commands Reference — Quick Cheat Sheet

    -- Check retention policy
    RMAN> SHOW RETENTION POLICY;
    
    -- Preview what would be deleted (dry run)
    RMAN> REPORT OBSOLETE;
    RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Emergency cleanup — delete pieces older than 30 days
    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Standard cleanup based on retention policy
    RMAN> DELETE NOPROMPT OBSOLETE;
    
    -- Sync catalog after any OS-level file operations
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE NOPROMPT EXPIRED BACKUP;
    
    -- Change to time-based retention (recommended)
    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    
    -- Check backup piece date range in control file
    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    -- Check backup history
    SELECT session_key, input_type, status,
           TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
           output_bytes_display, time_taken_display
    FROM v$rman_backup_job_details
    ORDER BY start_time DESC;
    

    Conclusion

    What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.

    That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.


    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.

  • Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

    Introduction

    One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

    This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

    Environment Overview

    Component Details
    Database Oracle 11g R2 (11.2.0.4.0)
    Architecture RAC — 2 Nodes
    Database Role PRIMARY — READ WRITE, ARCHIVELOG mode
    OS Linux (x86_64)
    FRA Location ASM Diskgroup +ARCHIVELOG
    FRA Size (pre-incident) 199 GB
    FRA Size (post-fix) 400 GB

    The Incident: What We Saw

    The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

    ORA-00257: archiver error. Connect internal only, until freed.
    ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
    ARCH: Archival stopped, error occurred. Will continue retrying
    ORACLE Instance PRODDB1 - Archival Error

    The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

    Step 1: Check Redo Log Status

    First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

    SELECT group#, status, archived, members FROM v$log ORDER BY group#;
    
    GROUP#   STATUS           ARCHIVED   MEMBERS
    ------   ---------------  ---------  -------
    1        ACTIVE           NO         2
    2        CURRENT          NO         2
    3        INACTIVE         YES        2
    4        INACTIVE         YES        2

    Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

    Step 2: Confirm the Archiver Status

    -- Check both RAC instances
    SELECT inst_id, instance_name, host_name, version,
           database_role, open_mode, log_mode
    FROM gv$instance ORDER BY inst_id;
    
    INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
    -------  ----------  ------------  -------------  ----------  ----------
          1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
          2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
    
    SELECT instance_name, status, archiver FROM v$instance;
    
    INSTANCE_NAME    STATUS       ARCHIVER
    ---------------- ------------ --------
    PRODDB1          OPEN         STOPPED

    ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

    Step 3: Check FRA Usage

    SELECT
        space_limit / (1024*1024*1024)       AS limit_gb,
        space_used  / (1024*1024*1024)       AS used_gb,
        space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
        ROUND((space_used / space_limit) * 100, 2) AS pct_used
    FROM v$recovery_file_dest;
    
    LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
    --------   -------   --------------   --------
    199        198.99     0.00            100.00

    100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

    Step 4: Identify What Was Consuming the FRA

    SELECT file_type,
        ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
        ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
        number_of_files
    FROM v$recovery_area_usage
    ORDER BY space_used DESC;
    
    FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
    -----------     -------   --------------   ---------------
    ARCHIVED LOG    38.50     0.05             1,200+
    BACKUP PIECE    10.20     0.07             45
    FLASHBACK LOG    1.30     0.00             12

    Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

    Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
           COUNT(*) AS logs,
           ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
    FROM v$archived_log
    WHERE dest_id=1
    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
    ORDER BY 1;
    Hour Archived Logs GB Generated
    Daytime (typical) 2–26 < 1.3 GB
    01:00 AM (nightly) 440–446 ~31 GB ⚠
    03:00–04:00 AM 191–411 11–28 GB ⚠
    05:00–06:00 AM 537–796 34–50 GB ⚠

    Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

    The Resolution — Three Actions

    Action 1: Manually Delete Archive Logs from ASM

    Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

    su - grid
    asmcmd
    
    ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
    # Recent logs retained to preserve recoverability

    After ASMCMD deletion, the RMAN catalog was resynced:

    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> DELETE EXPIRED ARCHIVELOG ALL;

    Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

    Action 2: Increase FRA Size Online

    -- Increase FRA from 199GB to 400GB (online, no restart required)
    ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

    SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

    ARC1: Archival started
    ARC2: Archival started
    ARC3: Archival started
    ORACLE Instance PRODDB1 - Archival Restarted

    Applications reconnected and transactions resumed within seconds.

    Action 3: Verify Database and Cluster Status

    SELECT instance_name, archiver FROM v$instance;
    -- PRODDB1   STARTED
    
    SELECT round(space_used/1024/1024/1024,2) used_gb,
           round(space_limit/1024/1024/1024,2) limit_gb,
           round((space_used/space_limit)*100,2) used_pct
    FROM v$recovery_file_dest;
    -- 202.80 GB / 400 GB / 50.70%
    
    SELECT group#, status, archived FROM v$log ORDER BY group#;
    -- All groups INACTIVE/YES except CURRENT group

    Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

    Root Cause Confirmed: Archive Log Backups Were Not Running

    With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

    Complete Incident Chain:
    ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

    Fix: Correct the Cron Environment

    Broken cron entry:

    05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    Fixed cron entry:

    05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    ARCH Backup Reruns and Completes

    After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

    . /home/oracle/.bash_profile
    /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

    The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

    Preventive Measures Put in Place

    1. Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
    2. RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
    3. FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
    4. FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
    5. RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

    Key Takeaways for Oracle DBAs

    • ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
    • When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
    • db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
    • Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
    • The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
    • Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
    • Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
    • After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
    • Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

    Conclusion

    A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

    The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

    The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

    Syed Anwar Ahmed | Oracle Apps DBA

  • Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem

    A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.


    Initial Observations

    Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.


    The Turning Point

    Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:

    SELECT status,
           ROUND(SUM(bytes)/1024/1024,2) MB
    FROM dba_undo_extents
    GROUP BY status;

    Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.


    Key Insight

    EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.


    Real Usage Calculation

    SELECT ROUND(
           SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
           / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents;

    Actual usage was only ~45% — not 97%.


    Root Cause of the False Alert

    The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.


    The Correct Monitoring Approach

    SELECT tablespace_name,
           ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
    / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents
    GROUP BY tablespace_name;

    Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts


    Key Takeaways

    • UNDO is not a regular tablespace — it has its own lifecycle
    • DBA_FREE_SPACE is misleading for UNDO monitoring
    • Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
    • Build context-aware monitoring that aligns with Oracle internals
    • Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn  |  GitLab

  • Rethinking UNDO Tablespace Monitoring in Multitenant Environments: A Case Study on APPS_UNDOTS1

    Introduction

    A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.

    This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.


    The Incident

    An alert was triggered showing:

    Tablespace              TOTAL_MB     USED_MB     FREE_MB PCT_USED
    -------------------- ----------- ----------- ----------- --------
    APPS_UNDOTS1 28,863.25 28,681.56 181.69 99.37

    At first glance, this indicated:

    • UNDO nearly full
    • Minimal free space
    • Risk of ORA-30036 / ORA-01555

    Environment Context

    This issue occurred in a Multitenant environment with Local UNDO enabled:

    • UNDO tablespace: APPS_UNDOTS1
    • Scope: PDB-level
    • Each PDB maintains its own UNDO

    UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.


    Investigation Approach

    Rather than relying on % used, a state-based analysis was performed.


    Step 1: Active Transactions

    SELECT COUNT(*) FROM v$transaction;

    Observation:

    • Minimal active transactions
    • No ongoing workload pressure

    Step 2: UNDO Extent Analysis

    SELECT status, SUM(bytes)/1024/1024 MB
    FROM dba_undo_extents
    GROUP BY status;

    Key Observation

    STATUS        MB         PCT
    ----------- -------- -------
    ACTIVE ~3 MB ~0.01%
    UNEXPIRED ~51 MB ~0.18%
    EXPIRED ~28 GB ~99.81%

    💡 Critical Insight

    ~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable

    This means:

    • No real space pressure
    • No transaction risk
    • Tablespace is effectively available

    Understanding UNDO Behavior

    UNDO extents follow a lifecycle:

    ACTIVE → UNEXPIRED → EXPIRED → REUSED
    StateDescriptionReusable
    ACTIVEUsed by active transactions
    UNEXPIREDRetained for consistency⚠️
    EXPIREDNo longer needed

    UNDO extents are not physically freed but logically reused by Oracle based on demand.


    Why the Alert Was Misleading

    The alert was based on:

    SELECT * FROM dba_free_space;

    Limitation

    • Shows only physically free space
    • Does not reflect reusable UNDO
    • Ignores Oracle’s internal reuse mechanism

    Root Cause

    A long-running concurrent request:

    • Ran for ~30 hours
    • Performed heavy DELETE operations
    • Generated significant UNDO

    During execution:

    • UNDO reached ~99% → real pressure

    After completion:

    • UNDO became EXPIRED
    • Space became reusable
    • Alert persisted → false positive

    When UNDO Is Actually a Problem

    ACTIVE      > 20–30%
    UNEXPIRED very high
    EXPIRED very low

    UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.


    Recommended Monitoring Approach

    WITH undo AS (
    SELECT status, SUM(bytes)/1024/1024 mb
    FROM dba_undo_extents
    GROUP BY status
    ),
    total AS (
    SELECT SUM(mb) total_mb FROM undo
    )
    SELECT
    u.status,
    ROUND(u.mb,2) AS mb,
    ROUND((u.mb / t.total_mb) * 100, 2) AS pct
    FROM undo u, total t;

    Improved Alert Logic

    • ACTIVE > 20% → Critical
    • UNEXPIRED > 80% → Warning
    • Otherwise → Normal

    Key Takeaways

    • UNDO is state-driven, not space-driven
    • EXPIRED undo = reusable capacity
    • % used is misleading
    • Always analyze at PDB level in multitenant setups
    • Monitoring must align with Oracle internals

    Conclusion

    UNDO tablespace monitoring must evolve from:

    ❌ Static space-based metrics
    ➡️
    ✅ Dynamic lifecycle-based analysis

    A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.


    Level Insight

    “UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”


    Final Thought

    👉 Don’t ask:
    “How full is UNDO?”

    👉 Ask:
    “How much of it is actually in use?”


    Author

    Syed Anwar Ahmed
    Oracle Apps DBA | Oracle EBS | Performance & Troubleshooting

  • When ADOP Remembers Too Much: Fixing Patch Failures Caused by Stale Metadata in Oracle EBS

    During an Oracle E-Business Suite ADOP patching cycle in a multi-node environment, the apply phase failed on one node while completing successfully on others. Despite retries — including downtime mode — the issue persisted, pointing to a deeper inconsistency within the patching framework.


    Symptoms Observed

    • ADOP session status: FAILED
    • Patch applied successfully on some nodes, failed on admin node
    • Repeated failures even with restart=no, abandon=yes, and downtime mode
    • No immediate actionable error from standard logs

    Timeline of Events

    T0 -- Patch execution initiated (ADOP apply phase)
    T1 -- Failure observed on admin node
    T2 -- Retry using downtime mode -- Failure persists
    T3 -- ADOP session review shows inconsistent state
    T4 -- Internal metadata tables analyzed
    T5 -- Cleanup performed (tables + restart directory)
    T6 -- Patch re-executed -- Success across all nodes

    Investigation

    Step 1: Check ADOP Session State

    Query the ADOP session status to understand the current state across all nodes:

    -- Check current ADOP session status
    SELECT session_id, node_name, phase, status,
           start_date, end_date
    FROM applsys.ad_adop_sessions
    ORDER BY start_date DESC;
    
    -- Check apply phase status per node
    SELECT s.session_id, n.node_name, p.phase_code,
           p.status, p.start_date, p.end_date
    FROM applsys.ad_adop_sessions s,
         applsys.ad_adop_session_phases p,
         applsys.fnd_nodes n
    WHERE s.session_id = p.session_id
    AND p.node_id = n.node_id
    ORDER BY p.start_date DESC;

    The existing session showed status FAILED with the apply phase partially completed — a clear indicator of inconsistent execution state across nodes.

    Step 2: Check adalldefaults.txt

    Reviewed the defaults file for any relevant configuration:

    cat $APPL_TOP/admin/$TWO_TASK/adalldefaults.txt | grep -i missing
    -- Key parameter found:
    -- MISSING_TRANSLATED_VERSION = No

    Modifying and retrying with this parameter had no impact, confirming the issue was not translation-related.

    Step 3: Check Install Processes Table

    -- Check for stale entries in FND_INSTALL_PROCESSES
    SELECT COUNT(*) FROM applsys.fnd_install_processes;
    
    -- View stale entries in detail
    SELECT process_status, process_name, last_update_date
    FROM applsys.fnd_install_processes
    ORDER BY last_update_date DESC;
    
    -- Check AD_DEFERRED_JOBS
    SELECT COUNT(*) FROM applsys.ad_deferred_jobs;
    SELECT * FROM applsys.ad_deferred_jobs;

    Observation: FND_INSTALL_PROCESSES contained stale entries from the failed session. AD_DEFERRED_JOBS was empty.


    Root Cause

    The failure was caused by stale and inconsistent ADOP metadata tables — specifically APPLSYS.FND_INSTALL_PROCESSES and APPLSYS.AD_DEFERRED_JOBS. ADOP internally relies on these tables to track patch progress checkpoints, deferred job execution, and restart state management. When these tables retain entries from failed or incomplete sessions, ADOP assumes an incorrect execution state, leading to patch reconciliation failure, apply phase breakdown, and node-level inconsistencies.


    Resolution Steps

    Step 1: Backup Critical Tables

    -- Always backup before any cleanup
    CREATE TABLE applsys.fnd_install_processes_bak AS
    SELECT * FROM applsys.fnd_install_processes;
    
    CREATE TABLE applsys.ad_deferred_jobs_bak AS
    SELECT * FROM applsys.ad_deferred_jobs;
    
    -- Verify backups
    SELECT COUNT(*) FROM applsys.fnd_install_processes_bak;
    SELECT COUNT(*) FROM applsys.ad_deferred_jobs_bak;

    Step 2: Drop Stale Metadata Tables

    Dropping these tables forces ADOP to rebuild clean metadata during the next run:

    DROP TABLE applsys.fnd_install_processes;
    DROP TABLE applsys.ad_deferred_jobs;

    Step 3: Reset the Restart Directory

    The restart directory can silently preserve failure states. Back it up and create a fresh one:

    cd $APPL_TOP/admin/$TWO_TASK
    
    -- Backup existing restart directory
    mv restart restart_bkp_$(date +%Y%m%d)
    
    -- Create fresh restart directory
    mkdir restart
    
    -- Verify
    ls -la | grep restart

    Step 4: Re-run the Patch

    adop phase=apply \
         patches=<patch_id> \
         restart=no \
         abandon=yes \
         apply_mode=downtime

    The patch completed successfully across all nodes after the metadata cleanup.


    Before vs After

    ComponentBefore FixAfter Fix
    ADOP SessionFailedSuccessful
    Node ConsistencyPartialFull
    Restart BehaviorStuckClean
    Patch ExecutionIncompleteCompleted

    Key Takeaways

    • ADOP is state-driven — even when logs appear clean, internal metadata drives execution decisions
    • Partial success is a clue — if some nodes succeed and one fails, focus on local metadata, not the patch itself
    • The restart directory matters — it can silently preserve failure states and must be validated before retrying
    • Downtime mode is not a fix-all — even in downtime, ADOP still reads metadata tables; corruption persists unless cleaned
    • Always backup before cleanup — never drop tables without creating a backup first

    When NOT to Use This Approach

    Avoid applying this fix if the issue is caused by missing database patches (ETCC warnings), file system or permission issues, incorrect patch sequencing, or environment misconfiguration. Always validate the root cause before performing any metadata cleanup.


    This scenario highlights a subtle but critical behavior in ADOP — sometimes patch failures are not caused by the patch itself, but by what the system remembers about past attempts. By resetting stale metadata, we allow ADOP to re-evaluate the environment cleanly, leading to successful execution.

    Have questions or faced a similar issue? Reach out at sdanwarahmed@gmail.com.

  • Oracle EBS Performance Issue: Inactive Forms Session Holding TX Locks (RCA and Resolution)

    In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.


    Observed Symptoms

    • Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
    • Concurrent programs stuck in running state
    • Increased database wait event: enq: TX - row lock contention

    Investigation

    Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.

    -- Identify blocking sessions
    SELECT blocking_session, sid, wait_class, event
    FROM v$session
    WHERE blocking_session IS NOT NULL;
    
    -- Detect inactive sessions with active transactions
    SELECT s.sid, s.serial#, s.status, s.program,
           s.username, t.start_time,
           ROUND(s.last_call_et/3600,2) hrs_inactive
    FROM v$session s, v$transaction t
    WHERE s.saddr = t.ses_addr
    AND s.status = 'INACTIVE'
    ORDER BY hrs_inactive DESC;
    
    -- Analyze full wait chain
    SELECT * FROM dba_wait_chains;

    The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.


    Root Cause

    An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.

    -- The problematic SQL pattern
    SELECT *
    FROM WSH_DELIVERY_DETAILS
    WHERE ROWID = :B1
    FOR UPDATE NOWAIT;

    Resolution

    The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:

    -- Kill blocking session (only after full validation and approval)
    ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;

    Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.


    Preventive Measures

    • Implement idle session timeout policies
    • Educate users on proper transaction handling in Oracle Forms
    • Review custom code using FOR UPDATE — keep transactions short and commit promptly
    • Monitor long-running and idle transactions proactively

    Key Takeaways

    • An inactive session can still hold active transactions and critical locks
    • Always identify the root blocker — intermediate sessions are symptoms, not the cause
    • Application-level inactivity does not mean database-level inactivity
    • In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Database Session Spike Mystery: When Connection Pools Collide

    In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.


    The Incident

    An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.

    -- Quick session count check
    SELECT COUNT(*) FROM gv$session;
    -- Result: 216 (already returning to normal)
    
    -- Session breakdown by status
    SELECT status, COUNT(*) cnt
    FROM gv$session
    GROUP BY status
    ORDER BY cnt DESC;

    Root Cause

    Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:

    Old Sessions (Inactive) + New Sessions (Active) = Session Surge

    As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.


    Recommendations

    • Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
    • Monitor inactive session trends to detect abnormal accumulation early
    • Configure appropriate idle timeout, maximum pool size, and session reuse settings

    Key Takeaways

    • Not all session spikes are database problems — check middleware behavior first
    • High session count does not necessarily indicate database stress
    • Transient issues still require analysis as they reveal architectural inefficiencies
    • Database alerts can originate from upstream connection management behavior

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

    Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.


    Observed Symptoms

    ORA-00031: session marked for kill
    Thread 1 cannot allocate new log
    Private strand flush not complete

    Phase 1: Interpreting ORA-00031 Correctly

    ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.


    Phase 2: Identifying the True Performance Signal

    The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.


    Phase 3: Evidence-Based Analysis (Read-Only)

    Redo switch frequency was analyzed to validate system behavior:

    SELECT
        TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
        COUNT(*) AS switches
    FROM v$log_history
    WHERE first_time > SYSDATE - 1
    GROUP BY TRUNC(first_time, 'HH24')
    ORDER BY 1;

    Findings

    MetricObservation
    Average Switch Rate5-7 per hour
    Peak Rate8-10 per hour during business hours
    Off-Peak Rate1-3 per hour

    A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.


    Why No Changes Were Made

    In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.


    Recommendations

    • Continuously monitor redo switch frequency during peak windows
    • Use collected data to justify future redo log sizing via change management
    • Avoid unnecessary intervention when behavior is transient and non-impacting
    • Distinguish informational alert log messages from actionable errors

    Key Takeaways

    • ORA-00031 is expected and harmless — it confirms session termination
    • Redo allocation waits are transient under sustained load
    • Proper analysis prevents unnecessary production intervention
    • Not all alert log warnings indicate failure — some are early signals of workload growth
    • The goal is not to eliminate every alert, but to understand which ones matter

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle EBS Login Issue: Real-Time Production Incident and Fix (OACORE + WebLogic Filter)


    📌 Introduction

    In Oracle E-Business Suite (EBS) environments, login failures are often perceived as simple application issues. However, in complex architectures, they can originate from multiple interacting layers across the application and middleware stack.

    In this blog, I’ll walk through a real-world production incident where an OACORE JVM issue combined with WebLogic security filtering resulted in complete login inaccessibility.

    This case highlights the importance of analyzing both performance and security layers together when troubleshooting critical application outages.


    ⚠️ Issue Summary

    • Users were unable to access the EBS login page
    • Pages were hanging or not loading
    • WebLogic console reported:
    Connection rejected, filter blocked Socket

    🔍 Initial Observation

    From the application server:

    • Load Average: ~10+

    👉 This indicated:

    • High CPU utilization
    • System under heavy stress
    • Potential JVM thread contention

    🔬 Detailed Analysis

    • One of the OACORE managed server JVMs became unresponsive
    • Long-running threads caused thread pool exhaustion
    • Incoming user requests began queueing

    At the same time:

    • WebLogic connection filter was actively enforcing access rules
    • Legitimate requests were being rejected under stressed conditions

    🧠 Understanding the Components

    OACORE (Application Layer)

    Handles:

    • Login requests
    • Forms processing
    • Core application logic

    If JVM threads are exhausted:
    👉 Requests queue → login hangs


    Oracle WebLogic Server Connection Filter

    EBS environments may use:

    oracle.apps.ad.tools.configuration.wls.filter.EBSConnectionFilterImpl

    This filter:

    • Enforces IP-based access control
    • Overrides default allow rules

    If misconfigured or stressed:
    👉 Legitimate traffic may be blocked


    🎯 Root Cause Analysis (RCA)

    The login issue was not caused by a single failure point, but by a combination of application tier resource exhaustion and restrictive middleware-level access control.

    • High CPU utilization and long-running threads caused one OACORE JVM to become unresponsive
    • Thread pool exhaustion led to request queuing, preventing new login requests from being processed

    Simultaneously:

    • The WebLogic connection filter (EBSConnectionFilterImpl) enforced strict access control policies
    • Under high load conditions, legitimate client requests were rejected with “filter blocked Socket”

    This interaction between performance degradation and security enforcement amplified the impact, resulting in complete login inaccessibility despite partial system availability.


    🛠️ Resolution Approach (Controlled & Safe)

    The resolution approach focused on stabilizing the JVM layer while validating and correcting middleware-level access controls in a controlled manner.


    🔹 Step 1: Identify Unresponsive JVM

    ps -ef | grep oacore

    ✔ Identify JVM with abnormal CPU or stuck behavior


    🔹 Step 2: Handle Stuck JVM (Controlled Action)

    ⚠️ Important Note:

    Forcefully terminating JVM processes should NOT be performed without validation.

    ✔ Recommended Approach:

    • Confirm the process is unresponsive
    • Ensure no critical transactions are running
    • Prefer controlled shutdown where possible

    ✔ Example (Only if fully unresponsive and approved):

    kill -9 <PID>

    👉 Node Manager can restart the JVM automatically after termination


    🔹 Step 3: Rolling Restart of OACORE

    admanagedsrvctl.sh stop oacore_server1
    admanagedsrvctl.sh start oacore_server1

    ✔ Ensures clean JVM state
    ✔ Restores thread pool balance


    🔹 Step 4: Validate WebLogic Connection Filter

    ⚠️ Important Note:

    The WebLogic connection filter is a security control and should NOT be disabled permanently.

    ✔ In this case:

    • Filter behavior was validated as part of troubleshooting
    • Temporary relaxation was used to confirm impact

    ✔ Recommended Approach:

    • Review allowed IP ranges
    • Validate filter configuration
    • Re-enable filter after correction

    The filter was re-enabled after validation and correction of configuration.


    🔹 Step 5: Restart Admin Server (If Required)

    adadminsrvctl.sh stop
    adadminsrvctl.sh start

    ✔ Ensures configuration changes are applied


    ✅ Final Outcome

    • System load reduced from ~10 → ~1.5
    • OACORE JVMs stabilized
    • Login page restored
    • Users successfully accessed the application

    📊 Key Learnings

    1. OACORE JVM issues can manifest as complete application outages due to thread exhaustion and request queuing.
    2. WebLogic connection filters can unintentionally block legitimate traffic, especially under high load conditions.
    3. Multi-layer failures (application + middleware + security) significantly amplify incident impact.
    4. System load metrics (CPU, load average) provide early indicators of JVM stress and should be monitored proactively.

    🧩 Preventive Measures

    • Monitor JVM thread utilization proactively
    • Review connection filter configurations periodically
    • Avoid peak concurrent load spikes
    • Implement alerting for high CPU / load conditions
    • Validate session and request handling patterns

    🔐 Governance Considerations

    • Avoid forceful JVM termination without proper validation
    • Do not disable security controls without understanding impact
    • Follow change management procedures in production
    • Coordinate with application and security teams before changes

    🏁 Conclusion

    This incident demonstrates that login failures in Oracle EBS are not always isolated to a single component but can result from complex interactions across application performance and middleware security layers.

    The combination of JVM resource exhaustion and connection filtering behavior created a compounded failure scenario, leading to complete login disruption.

    A structured, multi-layer troubleshooting approach—focused on performance, configuration, and governance—enabled effective resolution while minimizing risk.

    This reinforces the importance of analyzing both system behavior and security controls together when addressing critical production incidents.


    💡 Pro Tip

    When troubleshooting Oracle EBS login issues, always validate both:

    • JVM health (thread utilization, CPU load)
    • Middleware controls (connection filters, access rules)

    Ignoring either layer can lead to incomplete or misleading diagnosis.