Tag: oracle

  • Applying Oracle 19c Apr 2026 CPU Patch (19.31) on a Single Instance CDB

    In this article, we are going to demonstrate the steps to apply the Oracle April 2026 Critical Patch Update (CPU) on a Single Instance Oracle 19c Container Database (CDB). Here we will apply combo patch 39062931 which contains 39034528 (Database Apr 2026 Release Update 19.31.0.0.260421) and 38906621 (OJVM Component Release Update 19.31.0.0.260421). We also cover an important real-world scenario where a previously applied overlay patch 34672698 (ORA-00800 fix) conflicted with the Apr 2026 RU — confirming that the fix is now natively included in the RU itself.

    Follow the high-level plan to apply patch 39062931 (19.31.0.0.260421).

    • Go through the README documents of patch IDs before proceeding with patch apply.
    • Download the required patches from My Oracle Support.
    • Upgrade OPatch utility to the latest version as required by the patch README.
    • Perform pre-checks at CDB and PDB level — invalid objects, patch history, registry components.
    • Take Oracle Home TAR backup.
    • Blackout in OEM and disable crontab jobs before maintenance.
    • Shutdown database and listener.
    • Apply DB Release Update (39034528).
    • Apply OJVM Release Update (38906621).
    • Validate overlay patch (34672698) — check if still required post RU upgrade.
    • Start database and listener.
    • Execute datapatch on the database and validate.
    • Uncomment crontab and remove OEM blackout.
    • Perform post-checks and compare with pre-check data.

    Environment Details:

    COMPONENTS DETAILS
    Oracle Home /u01/app/oracle/product/19c/db_1
    DB Version 19.30.0.0.0 (Jan 2026 RU)
    OPatch Version (Before) 12.2.0.1.49
    OPatch Version (After) 12.2.0.1.51
    DB Type Single Instance, Multitenant (CDB+PDB)
    CDB Name DEVCDB
    PDBs DEVPDB1, DEVPDB2
    OS Linux x86-64
    Combo Patch 39062931 (19.31.0.0.260421)
    DB RU Patch 39034528
    OJVM Patch 38906621
    Overlay Patch 34672698 (validated — not reapplied)

    1. Pre-Checks

    1.1: Capture invalid objects at CDB level.

    sqlplus / as sysdba
    
    Set lines 230
    set pages 100
    col OBJECT_NAME for a40
    col OBJECT_TYPE for a40
    col OWNER for a20
    select OBJECT_NAME, OBJECT_TYPE, OWNER,
           to_char(LAST_DDL_TIME,'DD-MON-YYYY HH24:MI:SS'), status
    from dba_objects
    where status='INVALID'
    order by owner;
    no rows selected

    1.2: Capture patch history from dba_registry_sqlpatch.

    select patch_id, action, status, action_time, description
    from dba_registry_sqlpatch;
    PATCH_ID ACTION STATUS ACTION_TIME DESCRIPTION
    38523609 APPLY SUCCESS 03-FEB-26 OJVM RELEASE UPDATE: 19.30.0.0.260120
    38632161 APPLY SUCCESS 03-FEB-26 Database Release Update: 19.30.0.0.260120

    44 rows selected.

    1.3: Capture DB registry components.

    SELECT COMP_ID, comp_name, version, status FROM dba_registry;
    COMP_ID COMP_NAME VERSION STATUS
    CATALOG Oracle Database Catalog Views 19.30.0.0.0 VALID
    CATPROC Oracle Database Packages and Types 19.30.0.0.0 VALID
    JAVAVM JServer JAVA Virtual Machine 19.30.0.0.0 VALID
    XML Oracle XDK 19.30.0.0.0 VALID
    XDB Oracle XML Database 19.30.0.0.0 VALID
    ORDIM Oracle Multimedia 19.30.0.0.0 VALID
    CONTEXT Oracle Text 19.30.0.0.0 VALID
    … 15 rows total. All components VALID.

    1.4: Connect to each PDB and capture invalid objects as baseline.

    alter session set container=DEVPDB1;
    select OBJECT_NAME, OBJECT_TYPE, OWNER, status
    from dba_objects where status='INVALID'
    order by owner;
    📌 Note: Document all pre-existing invalid objects at PDB level before patching. These serve as a baseline for post-patch comparison. Do not attempt to fix pre-existing invalid objects before patching.

    1.5: Capture current patch inventory from OS.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    34672698 ORA-00800 SOFT EXTERNAL ERROR, ARGUMENTS [SET PRIORITY FAILED], [VKTM], DISM(16)
    38523609 OJVM RELEASE UPDATE: 19.30.0.0.260120 (38523609)
    38632161 Database Release Update: 19.30.0.0.260120 (38632161)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.

    2. Download Patches

    2.1: Create a dedicated directory for April 2026 patches.

    mkdir -p /u01/app/oracle/patches/APR_2026
    cd /u01/app/oracle/patches/APR_2026

    2.2: Download OPatch (patch 6880880).

    ⚠️ Important: Always download the Linux-x86-64 version of OPatch, not the generic LINUX (32-bit) version. The 32-bit LINUX zip will cause oui/lib/linux directory errors during patch apply.
    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p6880880_190000_Linux-x86-64.zip

    2.3: Download Combo Patch and Overlay Patch.

    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p39062931_190000_Linux-x86-64.zip
    
    wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
         "<DOWNLOAD_URL_FROM_MOS>" -O p34672698_1930000DBRU_Linux-x86-64.zip

    2.4: Unzip patches and verify structure.

    unzip p39062931_190000_Linux-x86-64.zip
    unzip p34672698_1930000DBRU_Linux-x86-64.zip
    
    ls -l 39062931/
    drwxr-x---. 4 oracle oinstall  6144 38906621    -- OJVM RU
    drwxr-x---. 4 oracle oinstall  6144 39034528    -- DB RU
    -rw-r--r--. 1 oracle oinstall   118 README.html

    3. Upgrade OPatch

    3.1: Check current OPatch version.

    [oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
    OPatch Version: 12.2.0.1.49
    OPatch succeeded.

    The April 2026 RU requires OPatch 12.2.0.1.51 or higher. We need to upgrade.

    3.2: Backup and replace OPatch.

    mv $ORACLE_HOME/OPatch $ORACLE_HOME/OPatch_old_12.2.0.1.49
    unzip p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME

    3.3: Verify new OPatch version.

    [oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
    OPatch Version: 12.2.0.1.51
    OPatch succeeded.
    💡 Tip: If OPatch fails with Invalid maximum heap size: -Xmx5120m, set the following before retrying:

    export OPATCH_JRE_MEMORY_OPTIONS="-Xmx1024m"

    Add this to ~/.bash_profile to make it permanent.


    4. Pre-Patching Activities

    4.1: Comment out crontab entries to prevent jobs from running during maintenance.

    [oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/^/##APR_2026##&/g' | /usr/bin/crontab)

    4.2: Place OEM blackout if OEM is configured for this target.

    📌 Note: If the environment has application-level database triggers, disable them before shutting down the database. Consult the application team before making any trigger changes. Re-enable them after patching is complete.

    5. Shutdown Database and Listener

    5.1: Shutdown the database and listener, and take an Oracle Home TAR backup.

    sqlplus / as sysdba
    SQL> shutdown immediate;
    Database closed.
    Database dismounted.
    ORACLE instance shut down.
    SQL> exit;
    
    lsnrctl stop LISTENER
    
    # Oracle Home TAR backup (run in background)
    nohup tar -zcvf /u01/app/oracle/patches/oracle_home_APR2026.tar.gz $ORACLE_HOME &

    6. Apply DB Release Update (39034528)

    6.1: Set PATH and run the prerequisite conflict check.

    export PATH=$ORACLE_HOME/OPatch:$PATH
    cd /u01/app/oracle/patches/APR_2026/39062931/39034528
    
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    ZOP-47: Superset Patch 39034528 has Subset Patch 38632161 which has overlay patches
    [34672698] and these overlay patches conflict with Superset Patch.
    OPatch failed with error code 73
    ⚠️ Action Required: The prereq check identified that overlay patch 34672698 conflicts with the new DB RU. Roll it back before applying the DB RU.

    6.2: Rollback overlay patch 34672698.

    opatch rollback -id 34672698
    Patch 34672698 successfully rolled back.
    OPatch succeeded.

    6.3: Re-run prereq check after rollback.

    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    Prereq "checkConflictAgainstOHWithDetail" passed.
    OPatch succeeded.

    6.4: Apply the DB Release Update.

    opatch apply
    Patch 39034528 successfully applied.
    Sub-set patch [38632161] has become inactive due to the application of a super-set patch [39034528].
    OPatch succeeded.

    6.5: Verify the patch is applied.

    opatch lsinventory | grep 39034528
    39034528  Database Release Update : 19.31.0.0.260421

    7. Apply OJVM Release Update (38906621)

    7.1: Run prereq check and apply the OJVM patch.

    export PATH=$ORACLE_HOME/OPatch:$PATH
    cd /u01/app/oracle/patches/APR_2026/39062931/38906621
    
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    Prereq "checkConflictAgainstOHWithDetail" passed.
    OPatch succeeded.
    
    opatch apply
    Patch 38906621 successfully applied.
    Sub-set patch [38523609] has become inactive due to the application of a super-set patch [38906621].
    OPatch succeeded.

    7.2: Verify both patches are listed.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    39034528 Database Release Update: 19.31.0.0.260421 (39034528)
    38906621 OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.

    8. Overlay Patch (34672698) — Conflict with Apr 2026 RU

    The overlay patch 34672698 was previously applied on top of the Jan 2026 RU to fix ORA-00800: SOFT EXTERNAL ERROR, ARGUMENTS: [SET PRIORITY FAILED], [VKTM], DISM(16). It was rolled back in Step 6. Before reapplying, always run the prereq check.

    📌 Note on oradism permissions: Some environments require a temporary oradism ownership change to oracle before applying the overlay patch. Always check the current permissions first:

    ls -l $ORACLE_HOME/bin/oradism

    If oradism is already oracle-owned (e.g., -rwxr-x---. 1 oracle oinstall), no root permission change is required. Only apply the chown/chmod steps if oradism is root-owned with setuid bit.

    8.1: Run prereq check before reapplying the overlay patch.

    cd /u01/app/oracle/patches/APR_2026/34672698
    opatch prereq CheckConflictAgainstOHWithDetail -ph ./
    
    ZOP-40: The patch(es) has conflicts with other patches installed in the Oracle Home.
    Following patches have conflicts: 34672698, 39034528
    OPatch failed with error code 73
    ✔ Expected Result: The prereq confirms overlay patch 34672698 conflicts with Apr 2026 DB RU (39034528). Oracle has included the ORA-00800 fix natively in the Apr 2026 RU. The overlay patch is no longer required and must NOT be reapplied.
    ⚠️ Important: This is expected behaviour when upgrading RUs. An overlay patch applied on a previous RU may become redundant when the next RU natively includes the same fix. Always run prereq check before reapplying overlay patches after an RU upgrade.

    8.2: Verify the overlay patch is absent from inventory.

    opatch lsinventory | grep 34672698
    [oracle@dbhost 34672698]$
    ✔ Confirmed: No output from the grep confirms the overlay patch is correctly absent from the Oracle Home.

    9. Start Database and Run Datapatch

    9.1: Start the database and open all PDBs.

    sqlplus / as sysdba
    SQL> startup;
    ORACLE instance started.
    Database mounted.
    Database opened.
    
    SQL> alter pluggable database all open;
    Pluggable database altered.
    
    SQL> exit;
    
    lsnrctl start LISTENER

    9.2: Run datapatch to apply the SQL portion of the patches.

    cd $ORACLE_HOME/OPatch
    ./datapatch -verbose
    SQL Patching tool version 19.31.0.0.0 Production on Tue May 5 13:17:39 2026
    
    Installation queue:
      For CDB$ROOT PDB$SEED DEVPDB1 DEVPDB2:
        Rollback: 38523609 (OJVM 19.30)
        Apply: 39034528 (DB RU 19.31)
        Apply: 38906621 (OJVM 19.31)
    
    Patch 38523609 rollback (CDB$ROOT): SUCCESS
    Patch 39034528 apply (CDB$ROOT): SUCCESS
    Patch 38906621 apply (CDB$ROOT): SUCCESS
    Patch 38523609 rollback (DEVPDB1): SUCCESS
    Patch 39034528 apply (DEVPDB1): SUCCESS
    Patch 38906621 apply (DEVPDB1): SUCCESS
    Patch 38523609 rollback (DEVPDB2): SUCCESS
    Patch 39034528 apply (DEVPDB2): SUCCESS
    Patch 38906621 apply (DEVPDB2): SUCCESS
    
    SQL Patching tool complete on Tue May 5 13:24:18 2026
    📌 Note: Datapatch applies the SQL changes to both CDB and all open PDBs. Wait for it to complete fully before proceeding. Do not interrupt the process.

    10. Post-Patching Activities

    10.1: Uncomment crontab entries.

    [oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/##APR_2026##//g' | /usr/bin/crontab)

    10.2: Remove OEM blackout if applicable.

    📌 Note: If application-level database triggers were disabled before patching, re-enable them now after confirming datapatch completed successfully. Consult the application team before enabling.

    11. Post-Checks

    11.1: Verify patch registry — all entries should show SUCCESS.

    select patch_id, action, status, action_time, description
    from dba_registry_sqlpatch
    order by action_time;
    PATCH_ID ACTION STATUS ACTION_TIME DESCRIPTION
    38523609 ROLLBACK SUCCESS 05-MAY-26 01:20:47 OJVM RELEASE UPDATE: 19.30.0.0.260120
    38906621 APPLY SUCCESS 05-MAY-26 01:21:02 OJVM RELEASE UPDATE: 19.31.0.0.260421
    39034528 APPLY SUCCESS 05-MAY-26 01:21:56 Database Release Update: 19.31.0.0.260421

    11.2: Verify all DB registry components are VALID.

    SELECT COMP_ID, comp_name, version, status FROM dba_registry;
    COMP_ID COMP_NAME VERSION STATUS
    CATALOG Oracle Database Catalog Views 19.31.0.0.0 VALID
    CATPROC Oracle Database Packages and Types 19.31.0.0.0 VALID
    JAVAVM JServer JAVA Virtual Machine 19.31.0.0.0 VALID
    XML Oracle XDK 19.31.0.0.0 VALID
    XDB Oracle XML Database 19.31.0.0.0 VALID
    ORDIM Oracle Multimedia 19.31.0.0.0 VALID
    CONTEXT Oracle Text 19.31.0.0.0 VALID
    … 15 rows total. All components VALID.

    11.3: Confirm final patch inventory.

    [oracle@dbhost ~]$ opatch lspatches
    PATCH_ID DESCRIPTION
    39034528 Database Release Update: 19.31.0.0.260421 (39034528)
    38906621 OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
    29585399 OCW RELEASE UPDATE 19.3.0.0.0 (29585399)
    OPatch succeeded.
    ✔ Confirmed: Overlay patch 34672698 is no longer listed. Its fix is now natively included in the Apr 2026 RU (39034528). This is the expected and correct final state.

    11.4: Compare invalid objects at CDB and PDB level with pre-patch baseline. Compile any new invalid objects introduced by the patch:

    @$ORACLE_HOME/rdbms/admin/utlrp.sql

    Key MOS References

    Document Description
    KB869205 Oracle Database 19c Apr 2026 RU Known Issues
    KB137197 OJVM Conditional Rolling Install Details
    244241.1 OPatch Support for RAC Rolling Patches
    293369.1 OPatch Documentation List

    Disclaimer: All server names, hostnames, database names, and environment-specific paths in this post have been anonymized. Steps and outputs are based on real production experience adapted for general use. Always test in a non-production environment before applying patches to production systems.

    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More patching walkthroughs at syedanwarahmedoracle.blog.


    💬 Share Your Experience

    Have you applied Oracle 19c CPU patches and hit a similar overlay patch conflict? Or found that a fix you applied months ago is now natively bundled into the next RU? Your real-world experiences help fellow DBAs tackle the same challenges. Drop a comment below — questions, observations, and feedback are always welcome.

  • P1 Incident: /dbbackup Filesystem 100% Full — How We Traced, Fixed, and Recovered Two Failed RMAN Backups in One Night

    🔴 Incident Overview

    Severity P1 — Two production database backups failed
    Environment Oracle 19c (19.30) on Linux x86-64
    Backup Tool RMAN with Recovery Catalog
    Backup Volume /dbbackup — 1TB LVM filesystem
    Databases Affected DBPRO01 (12:30 failure), DBPRO02 (22:30 failure)
    Total Space Recovered ~492G (disk went from 100% → 52%)

    1. The Alerts — Two Failures, Same Root Cause

    It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:

    RMAN-03009: failure of backup command on c4 channel at 22:32:13
    ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
    ORA-27072: File I/O error
    Additional information: 4
    

    Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.

    A quick check confirmed it:

    $ df -hP /dbbackup
    Filesystem                        Size  Used Avail Use%
    /dev/mapper/orabkupvg-orabkuplv1 1023G 1020G  3.3G 100%
    

    1TB volume. 3.3G free. Completely full.

    What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.


    2. The Investigation — Folder by Folder

    The first step was understanding what was consuming the disk. One command gave us the top-level picture:

    $ du -sh /dbbackup/*
    744G    DBPRC01
    152G    ColdBackup_11April2026
     41G    DBPRC02
     26G    DBPRC03
     15G    DBPRO01
    7.7G    JAN2026_CPU
    7.7G    OCT2025_CPU
    5.3G    infra_arch
    

    744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.

    Drilling into DBPRC01

    $ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
    7.5G    DiffInc_DBPRC01_fl4lumij
    7.5G    DiffInc_DBPRC01_eg4lrnh5
    7.5G    DiffInc_DBPRC01_b44lc7uh
    ...
    

    Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.

    $ ls /dbbackup/DBPRC01/rman/ | wc -l
    1522
    

    1,522 backup pieces. We checked the oldest and newest:

    Oldest file on disk:  2022-05-07
    Newest file on disk:  2026-04-25
    

    Four years of backup files on disk — or so we thought.


    3. The RMAN Investigation — Where Things Got Interesting

    We connected RMAN to the database and ran the retention check:

    RMAN> SHOW RETENTION POLICY;
    CONFIGURE RETENTION POLICY TO REDUNDANCY 30;
    

    REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.

    Next logical step — check what RMAN considers obsolete:

    RMAN> REPORT OBSOLETE;
    no obsolete backups found
    

    Nothing? With 1,522 files on disk?

    We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:

    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    OLDEST          NEWEST          TOTAL_PIECES
    03-DEC-2025     25-APR-2026     1541
    

    The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.

    This was the key insight: RMAN was not broken. The retention policy itself was the problem.


    4. Root Cause — The Architecture Trap

    The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:

    connect target rman/password@DBPRO01
    ...
    format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
    (database);
    ...
    delete obsolete;
    

    DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.

    This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.

    The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.

    The math:

    Retention policy REDUNDANCY 30
    Level 0 frequency Weekly (Sundays)
    Effective retention period ~30 weeks / 7.5 months
    Backup tracking since December 2025 (~5 months)
    Result delete obsolete finds nothing — ever
    Daily backup size ~7–7.5G per run
    Total accumulated 744G

    Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.


    5. Secondary Findings During Investigation

    OCT2025 CPU Patch Artifacts (7.7G)

    The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.

    $ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
    # Empty — neither Oct 2025 patch in inventory anymore
    

    5-Year-Old Pre-Migration Export Dumps

    Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.

    Recovery Catalog Version Mismatch

    The original RMAN log flagged this warning:

    PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
    PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current
    

    The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.


    6. The Fix — Emergency Space Recovery

    With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:

    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    

    This command does three things atomically:

    1. Queries catalog/controlfile for all pieces completed before the cutoff date
    2. Deletes the physical files from disk
    3. Removes the records from RMAN catalog — no orphaned entries, no catalog drift

    The output scrolled for several minutes:

    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
    ...
    Deleted 1072 objects
    

    1,072 backup pieces deleted. Catalog updated. Disk checked:

    BEFORE:  Used 1020G  Avail 3.3G  (100%)
    AFTER:   Used  536G  Avail 488G   (53%)
    

    Then the OCT2025_CPU directory was removed:

    $ rm -rf /dbbackup/OCT2025_CPU/
    $ df -hP /dbbackup
    Used 528G  Avail 495G  (52%)
    

    Final result: 495G free. Disk at 52%.

    Both failed backups were re-submitted immediately and ran successfully in parallel:

    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &
    
    $ jobs -l
    [1] Running   nohup sh ...rman_backup_DBPRO01.sh &
    [2] Running   nohup sh ...rman_backup_DBPRO02.sh &
    

    7. Incident Timeline

    12:30 DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full)
    22:30 DBPRO02 Level 0 backup fails — same errors, all 4 channels
    23:08 Investigation begins — df -hP /dbbackup confirms 100% full
    23:15 DBPRC01 directory identified as 744G consumer
    23:25 RMAN connected — REDUNDANCY 30 discovered
    23:35 Architecture confirmed — PRO databases backing up into PRC directories
    23:45 Root cause confirmed — 7.5-month retention, delete obsolete finds nothing
    23:50 DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed
    23:51 1,072 pieces deleted — disk drops to 53%
    23:55 OCT2025_CPU removed — disk at 52%, 495G free
    00:00 Both backup jobs re-submitted and running successfully

    8. Permanent Fix Recommendations

    Fix 1 — Change Retention Policy to RECOVERY WINDOW

    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    

    REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.

    Fix 2 — Add Pre-Backup Space Check to Cron Script

    #!/bin/bash
    BACKUP_FS="/dbbackup"
    THRESHOLD=20
    
    AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')
    
    if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
      echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
        | mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
      exit 1
    fi
    

    A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.

    Fix 3 — Upgrade the Recovery Catalog

    RMAN> CONNECT TARGET /
    RMAN> CONNECT CATALOG rman/password@rmancat
    RMAN> UPGRADE CATALOG;
    RMAN> UPGRADE CATALOG;   -- run twice as prompted
    

    The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.

    Fix 4 — Filesystem Monitoring Alert

    The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.


    9. Key Takeaways for Oracle DBAs

    REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.

    • Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
    • Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
    • Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
    • Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
    • Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.

    10. Commands Reference — Quick Cheat Sheet

    -- Check retention policy
    RMAN> SHOW RETENTION POLICY;
    
    -- Preview what would be deleted (dry run)
    RMAN> REPORT OBSOLETE;
    RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Emergency cleanup — delete pieces older than 30 days
    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Standard cleanup based on retention policy
    RMAN> DELETE NOPROMPT OBSOLETE;
    
    -- Sync catalog after any OS-level file operations
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE NOPROMPT EXPIRED BACKUP;
    
    -- Change to time-based retention (recommended)
    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    
    -- Check backup piece date range in control file
    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    -- Check backup history
    SELECT session_key, input_type, status,
           TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
           output_bytes_display, time_taken_display
    FROM v$rman_backup_job_details
    ORDER BY start_time DESC;
    

    Conclusion

    What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.

    That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.


    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.

  • Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

    Introduction

    One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

    This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

    Environment Overview

    Component Details
    Database Oracle 11g R2 (11.2.0.4.0)
    Architecture RAC — 2 Nodes
    Database Role PRIMARY — READ WRITE, ARCHIVELOG mode
    OS Linux (x86_64)
    FRA Location ASM Diskgroup +ARCHIVELOG
    FRA Size (pre-incident) 199 GB
    FRA Size (post-fix) 400 GB

    The Incident: What We Saw

    The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

    ORA-00257: archiver error. Connect internal only, until freed.
    ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
    ARCH: Archival stopped, error occurred. Will continue retrying
    ORACLE Instance PRODDB1 - Archival Error

    The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

    Step 1: Check Redo Log Status

    First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

    SELECT group#, status, archived, members FROM v$log ORDER BY group#;
    
    GROUP#   STATUS           ARCHIVED   MEMBERS
    ------   ---------------  ---------  -------
    1        ACTIVE           NO         2
    2        CURRENT          NO         2
    3        INACTIVE         YES        2
    4        INACTIVE         YES        2

    Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

    Step 2: Confirm the Archiver Status

    -- Check both RAC instances
    SELECT inst_id, instance_name, host_name, version,
           database_role, open_mode, log_mode
    FROM gv$instance ORDER BY inst_id;
    
    INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
    -------  ----------  ------------  -------------  ----------  ----------
          1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
          2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
    
    SELECT instance_name, status, archiver FROM v$instance;
    
    INSTANCE_NAME    STATUS       ARCHIVER
    ---------------- ------------ --------
    PRODDB1          OPEN         STOPPED

    ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

    Step 3: Check FRA Usage

    SELECT
        space_limit / (1024*1024*1024)       AS limit_gb,
        space_used  / (1024*1024*1024)       AS used_gb,
        space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
        ROUND((space_used / space_limit) * 100, 2) AS pct_used
    FROM v$recovery_file_dest;
    
    LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
    --------   -------   --------------   --------
    199        198.99     0.00            100.00

    100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

    Step 4: Identify What Was Consuming the FRA

    SELECT file_type,
        ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
        ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
        number_of_files
    FROM v$recovery_area_usage
    ORDER BY space_used DESC;
    
    FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
    -----------     -------   --------------   ---------------
    ARCHIVED LOG    38.50     0.05             1,200+
    BACKUP PIECE    10.20     0.07             45
    FLASHBACK LOG    1.30     0.00             12

    Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

    Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
           COUNT(*) AS logs,
           ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
    FROM v$archived_log
    WHERE dest_id=1
    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
    ORDER BY 1;
    Hour Archived Logs GB Generated
    Daytime (typical) 2–26 < 1.3 GB
    01:00 AM (nightly) 440–446 ~31 GB ⚠
    03:00–04:00 AM 191–411 11–28 GB ⚠
    05:00–06:00 AM 537–796 34–50 GB ⚠

    Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

    The Resolution — Three Actions

    Action 1: Manually Delete Archive Logs from ASM

    Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

    su - grid
    asmcmd
    
    ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
    # Recent logs retained to preserve recoverability

    After ASMCMD deletion, the RMAN catalog was resynced:

    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> DELETE EXPIRED ARCHIVELOG ALL;

    Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

    Action 2: Increase FRA Size Online

    -- Increase FRA from 199GB to 400GB (online, no restart required)
    ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

    SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

    ARC1: Archival started
    ARC2: Archival started
    ARC3: Archival started
    ORACLE Instance PRODDB1 - Archival Restarted

    Applications reconnected and transactions resumed within seconds.

    Action 3: Verify Database and Cluster Status

    SELECT instance_name, archiver FROM v$instance;
    -- PRODDB1   STARTED
    
    SELECT round(space_used/1024/1024/1024,2) used_gb,
           round(space_limit/1024/1024/1024,2) limit_gb,
           round((space_used/space_limit)*100,2) used_pct
    FROM v$recovery_file_dest;
    -- 202.80 GB / 400 GB / 50.70%
    
    SELECT group#, status, archived FROM v$log ORDER BY group#;
    -- All groups INACTIVE/YES except CURRENT group

    Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

    Root Cause Confirmed: Archive Log Backups Were Not Running

    With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

    Complete Incident Chain:
    ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

    Fix: Correct the Cron Environment

    Broken cron entry:

    05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    Fixed cron entry:

    05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    ARCH Backup Reruns and Completes

    After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

    . /home/oracle/.bash_profile
    /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

    The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

    Preventive Measures Put in Place

    1. Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
    2. RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
    3. FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
    4. FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
    5. RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

    Key Takeaways for Oracle DBAs

    • ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
    • When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
    • db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
    • Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
    • The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
    • Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
    • Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
    • After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
    • Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

    Conclusion

    A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

    The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

    The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

    Syed Anwar Ahmed | Oracle Apps DBA

  • Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem

    A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.


    Initial Observations

    Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.


    The Turning Point

    Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:

    SELECT status,
           ROUND(SUM(bytes)/1024/1024,2) MB
    FROM dba_undo_extents
    GROUP BY status;

    Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.


    Key Insight

    EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.


    Real Usage Calculation

    SELECT ROUND(
           SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
           / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents;

    Actual usage was only ~45% — not 97%.


    Root Cause of the False Alert

    The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.


    The Correct Monitoring Approach

    SELECT tablespace_name,
           ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
    / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents
    GROUP BY tablespace_name;

    Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts


    Key Takeaways

    • UNDO is not a regular tablespace — it has its own lifecycle
    • DBA_FREE_SPACE is misleading for UNDO monitoring
    • Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
    • Build context-aware monitoring that aligns with Oracle internals
    • Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn  |  GitLab

  • Rethinking UNDO Tablespace Monitoring in Multitenant Environments: A Case Study on APPS_UNDOTS1

    Introduction

    A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.

    This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.


    The Incident

    An alert was triggered showing:

    Tablespace              TOTAL_MB     USED_MB     FREE_MB PCT_USED
    -------------------- ----------- ----------- ----------- --------
    APPS_UNDOTS1 28,863.25 28,681.56 181.69 99.37

    At first glance, this indicated:

    • UNDO nearly full
    • Minimal free space
    • Risk of ORA-30036 / ORA-01555

    Environment Context

    This issue occurred in a Multitenant environment with Local UNDO enabled:

    • UNDO tablespace: APPS_UNDOTS1
    • Scope: PDB-level
    • Each PDB maintains its own UNDO

    UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.


    Investigation Approach

    Rather than relying on % used, a state-based analysis was performed.


    Step 1: Active Transactions

    SELECT COUNT(*) FROM v$transaction;

    Observation:

    • Minimal active transactions
    • No ongoing workload pressure

    Step 2: UNDO Extent Analysis

    SELECT status, SUM(bytes)/1024/1024 MB
    FROM dba_undo_extents
    GROUP BY status;

    Key Observation

    STATUS        MB         PCT
    ----------- -------- -------
    ACTIVE ~3 MB ~0.01%
    UNEXPIRED ~51 MB ~0.18%
    EXPIRED ~28 GB ~99.81%

    💡 Critical Insight

    ~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable

    This means:

    • No real space pressure
    • No transaction risk
    • Tablespace is effectively available

    Understanding UNDO Behavior

    UNDO extents follow a lifecycle:

    ACTIVE → UNEXPIRED → EXPIRED → REUSED
    StateDescriptionReusable
    ACTIVEUsed by active transactions
    UNEXPIREDRetained for consistency⚠️
    EXPIREDNo longer needed

    UNDO extents are not physically freed but logically reused by Oracle based on demand.


    Why the Alert Was Misleading

    The alert was based on:

    SELECT * FROM dba_free_space;

    Limitation

    • Shows only physically free space
    • Does not reflect reusable UNDO
    • Ignores Oracle’s internal reuse mechanism

    Root Cause

    A long-running concurrent request:

    • Ran for ~30 hours
    • Performed heavy DELETE operations
    • Generated significant UNDO

    During execution:

    • UNDO reached ~99% → real pressure

    After completion:

    • UNDO became EXPIRED
    • Space became reusable
    • Alert persisted → false positive

    When UNDO Is Actually a Problem

    ACTIVE      > 20–30%
    UNEXPIRED very high
    EXPIRED very low

    UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.


    Recommended Monitoring Approach

    WITH undo AS (
    SELECT status, SUM(bytes)/1024/1024 mb
    FROM dba_undo_extents
    GROUP BY status
    ),
    total AS (
    SELECT SUM(mb) total_mb FROM undo
    )
    SELECT
    u.status,
    ROUND(u.mb,2) AS mb,
    ROUND((u.mb / t.total_mb) * 100, 2) AS pct
    FROM undo u, total t;

    Improved Alert Logic

    • ACTIVE > 20% → Critical
    • UNEXPIRED > 80% → Warning
    • Otherwise → Normal

    Key Takeaways

    • UNDO is state-driven, not space-driven
    • EXPIRED undo = reusable capacity
    • % used is misleading
    • Always analyze at PDB level in multitenant setups
    • Monitoring must align with Oracle internals

    Conclusion

    UNDO tablespace monitoring must evolve from:

    ❌ Static space-based metrics
    ➡️
    ✅ Dynamic lifecycle-based analysis

    A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.


    Level Insight

    “UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”


    Final Thought

    👉 Don’t ask:
    “How full is UNDO?”

    👉 Ask:
    “How much of it is actually in use?”


    Author

    Syed Anwar Ahmed
    Oracle Apps DBA | Oracle EBS | Performance & Troubleshooting

  • When ADOP Remembers Too Much: Fixing Patch Failures Caused by Stale Metadata in Oracle EBS

    During an Oracle E-Business Suite ADOP patching cycle in a multi-node environment, the apply phase failed on one node while completing successfully on others. Despite retries — including downtime mode — the issue persisted, pointing to a deeper inconsistency within the patching framework.


    Symptoms Observed

    • ADOP session status: FAILED
    • Patch applied successfully on some nodes, failed on admin node
    • Repeated failures even with restart=no, abandon=yes, and downtime mode
    • No immediate actionable error from standard logs

    Timeline of Events

    T0 -- Patch execution initiated (ADOP apply phase)
    T1 -- Failure observed on admin node
    T2 -- Retry using downtime mode -- Failure persists
    T3 -- ADOP session review shows inconsistent state
    T4 -- Internal metadata tables analyzed
    T5 -- Cleanup performed (tables + restart directory)
    T6 -- Patch re-executed -- Success across all nodes

    Investigation

    Step 1: Check ADOP Session State

    Query the ADOP session status to understand the current state across all nodes:

    -- Check current ADOP session status
    SELECT session_id, node_name, phase, status,
           start_date, end_date
    FROM applsys.ad_adop_sessions
    ORDER BY start_date DESC;
    
    -- Check apply phase status per node
    SELECT s.session_id, n.node_name, p.phase_code,
           p.status, p.start_date, p.end_date
    FROM applsys.ad_adop_sessions s,
         applsys.ad_adop_session_phases p,
         applsys.fnd_nodes n
    WHERE s.session_id = p.session_id
    AND p.node_id = n.node_id
    ORDER BY p.start_date DESC;

    The existing session showed status FAILED with the apply phase partially completed — a clear indicator of inconsistent execution state across nodes.

    Step 2: Check adalldefaults.txt

    Reviewed the defaults file for any relevant configuration:

    cat $APPL_TOP/admin/$TWO_TASK/adalldefaults.txt | grep -i missing
    -- Key parameter found:
    -- MISSING_TRANSLATED_VERSION = No

    Modifying and retrying with this parameter had no impact, confirming the issue was not translation-related.

    Step 3: Check Install Processes Table

    -- Check for stale entries in FND_INSTALL_PROCESSES
    SELECT COUNT(*) FROM applsys.fnd_install_processes;
    
    -- View stale entries in detail
    SELECT process_status, process_name, last_update_date
    FROM applsys.fnd_install_processes
    ORDER BY last_update_date DESC;
    
    -- Check AD_DEFERRED_JOBS
    SELECT COUNT(*) FROM applsys.ad_deferred_jobs;
    SELECT * FROM applsys.ad_deferred_jobs;

    Observation: FND_INSTALL_PROCESSES contained stale entries from the failed session. AD_DEFERRED_JOBS was empty.


    Root Cause

    The failure was caused by stale and inconsistent ADOP metadata tables — specifically APPLSYS.FND_INSTALL_PROCESSES and APPLSYS.AD_DEFERRED_JOBS. ADOP internally relies on these tables to track patch progress checkpoints, deferred job execution, and restart state management. When these tables retain entries from failed or incomplete sessions, ADOP assumes an incorrect execution state, leading to patch reconciliation failure, apply phase breakdown, and node-level inconsistencies.


    Resolution Steps

    Step 1: Backup Critical Tables

    -- Always backup before any cleanup
    CREATE TABLE applsys.fnd_install_processes_bak AS
    SELECT * FROM applsys.fnd_install_processes;
    
    CREATE TABLE applsys.ad_deferred_jobs_bak AS
    SELECT * FROM applsys.ad_deferred_jobs;
    
    -- Verify backups
    SELECT COUNT(*) FROM applsys.fnd_install_processes_bak;
    SELECT COUNT(*) FROM applsys.ad_deferred_jobs_bak;

    Step 2: Drop Stale Metadata Tables

    Dropping these tables forces ADOP to rebuild clean metadata during the next run:

    DROP TABLE applsys.fnd_install_processes;
    DROP TABLE applsys.ad_deferred_jobs;

    Step 3: Reset the Restart Directory

    The restart directory can silently preserve failure states. Back it up and create a fresh one:

    cd $APPL_TOP/admin/$TWO_TASK
    
    -- Backup existing restart directory
    mv restart restart_bkp_$(date +%Y%m%d)
    
    -- Create fresh restart directory
    mkdir restart
    
    -- Verify
    ls -la | grep restart

    Step 4: Re-run the Patch

    adop phase=apply \
         patches=<patch_id> \
         restart=no \
         abandon=yes \
         apply_mode=downtime

    The patch completed successfully across all nodes after the metadata cleanup.


    Before vs After

    ComponentBefore FixAfter Fix
    ADOP SessionFailedSuccessful
    Node ConsistencyPartialFull
    Restart BehaviorStuckClean
    Patch ExecutionIncompleteCompleted

    Key Takeaways

    • ADOP is state-driven — even when logs appear clean, internal metadata drives execution decisions
    • Partial success is a clue — if some nodes succeed and one fails, focus on local metadata, not the patch itself
    • The restart directory matters — it can silently preserve failure states and must be validated before retrying
    • Downtime mode is not a fix-all — even in downtime, ADOP still reads metadata tables; corruption persists unless cleaned
    • Always backup before cleanup — never drop tables without creating a backup first

    When NOT to Use This Approach

    Avoid applying this fix if the issue is caused by missing database patches (ETCC warnings), file system or permission issues, incorrect patch sequencing, or environment misconfiguration. Always validate the root cause before performing any metadata cleanup.


    This scenario highlights a subtle but critical behavior in ADOP — sometimes patch failures are not caused by the patch itself, but by what the system remembers about past attempts. By resetting stale metadata, we allow ADOP to re-evaluate the environment cleanly, leading to successful execution.

    Have questions or faced a similar issue? Reach out at sdanwarahmed@gmail.com.

  • Oracle EBS Performance Issue: Inactive Forms Session Holding TX Locks (RCA and Resolution)

    In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.


    Observed Symptoms

    • Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
    • Concurrent programs stuck in running state
    • Increased database wait event: enq: TX - row lock contention

    Investigation

    Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.

    -- Identify blocking sessions
    SELECT blocking_session, sid, wait_class, event
    FROM v$session
    WHERE blocking_session IS NOT NULL;
    
    -- Detect inactive sessions with active transactions
    SELECT s.sid, s.serial#, s.status, s.program,
           s.username, t.start_time,
           ROUND(s.last_call_et/3600,2) hrs_inactive
    FROM v$session s, v$transaction t
    WHERE s.saddr = t.ses_addr
    AND s.status = 'INACTIVE'
    ORDER BY hrs_inactive DESC;
    
    -- Analyze full wait chain
    SELECT * FROM dba_wait_chains;

    The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.


    Root Cause

    An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.

    -- The problematic SQL pattern
    SELECT *
    FROM WSH_DELIVERY_DETAILS
    WHERE ROWID = :B1
    FOR UPDATE NOWAIT;

    Resolution

    The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:

    -- Kill blocking session (only after full validation and approval)
    ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;

    Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.


    Preventive Measures

    • Implement idle session timeout policies
    • Educate users on proper transaction handling in Oracle Forms
    • Review custom code using FOR UPDATE — keep transactions short and commit promptly
    • Monitor long-running and idle transactions proactively

    Key Takeaways

    • An inactive session can still hold active transactions and critical locks
    • Always identify the root blocker — intermediate sessions are symptoms, not the cause
    • Application-level inactivity does not mean database-level inactivity
    • In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Database Session Spike Mystery: When Connection Pools Collide

    In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.


    The Incident

    An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.

    -- Quick session count check
    SELECT COUNT(*) FROM gv$session;
    -- Result: 216 (already returning to normal)
    
    -- Session breakdown by status
    SELECT status, COUNT(*) cnt
    FROM gv$session
    GROUP BY status
    ORDER BY cnt DESC;

    Root Cause

    Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:

    Old Sessions (Inactive) + New Sessions (Active) = Session Surge

    As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.


    Recommendations

    • Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
    • Monitor inactive session trends to detect abnormal accumulation early
    • Configure appropriate idle timeout, maximum pool size, and session reuse settings

    Key Takeaways

    • Not all session spikes are database problems — check middleware behavior first
    • High session count does not necessarily indicate database stress
    • Transient issues still require analysis as they reveal architectural inefficiencies
    • Database alerts can originate from upstream connection management behavior

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

    Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.


    Observed Symptoms

    ORA-00031: session marked for kill
    Thread 1 cannot allocate new log
    Private strand flush not complete

    Phase 1: Interpreting ORA-00031 Correctly

    ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.


    Phase 2: Identifying the True Performance Signal

    The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.


    Phase 3: Evidence-Based Analysis (Read-Only)

    Redo switch frequency was analyzed to validate system behavior:

    SELECT
        TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
        COUNT(*) AS switches
    FROM v$log_history
    WHERE first_time > SYSDATE - 1
    GROUP BY TRUNC(first_time, 'HH24')
    ORDER BY 1;

    Findings

    MetricObservation
    Average Switch Rate5-7 per hour
    Peak Rate8-10 per hour during business hours
    Off-Peak Rate1-3 per hour

    A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.


    Why No Changes Were Made

    In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.


    Recommendations

    • Continuously monitor redo switch frequency during peak windows
    • Use collected data to justify future redo log sizing via change management
    • Avoid unnecessary intervention when behavior is transient and non-impacting
    • Distinguish informational alert log messages from actionable errors

    Key Takeaways

    • ORA-00031 is expected and harmless — it confirms session termination
    • Redo allocation waits are transient under sustained load
    • Proper analysis prevents unnecessary production intervention
    • Not all alert log warnings indicate failure — some are early signals of workload growth
    • The goal is not to eliminate every alert, but to understand which ones matter

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Resolving Critical FRA Space Issue Using RMAN Optimization (Real-World DBA Scenario)

    During routine monitoring, a critical issue was identified where the Fast Recovery Area (FRA) utilization reached 95%, putting the database at risk of ORA-19809, archiver process failure, and potential database hang. Despite having an RMAN retention policy configured, the FRA continued to grow uncontrollably.


    Investigation

    -- Check FRA usage
    SELECT name, space_limit/1024/1024/1024 limit_gb,
           space_used/1024/1024/1024 used_gb,
           ROUND(space_used/space_limit*100,2) pct_used
    FROM v$recovery_file_dest;
    
    -- Check what is consuming space
    SELECT file_type, percent_space_used, number_of_files
    FROM v$flash_recovery_area_usage
    ORDER BY percent_space_used DESC;
    
    -- Review RMAN configuration
    SHOW ALL;

    Findings: backupsets were consuming the majority of FRA space. RMAN had a 7-day retention policy configured but backup optimization was disabled and archivelog deletion policy was not set.


    Root Cause

    The retention policy was defined but obsolete backups were not being automatically deleted. There was no scheduled execution of DELETE OBSOLETE, backup optimization was disabled causing redundant data, and archivelog lifecycle was not controlled. This led to continuous accumulation of backupsets and archivelogs.


    Immediate Fix

    -- Run RMAN cleanup
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE EXPIRED BACKUP;
    RMAN> DELETE OBSOLETE;
    
    -- Verify FRA usage after cleanup
    SELECT ROUND(space_used/space_limit*100,2) pct_used
    FROM v$recovery_file_dest;

    FRA utilization reduced from 95% to 83% after cleanup.


    Configuration Fixes

    -- Enable backup optimization
    CONFIGURE BACKUP OPTIMIZATION ON;
    
    -- Set archivelog deletion policy
    CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO DISK;
    
    -- Enable compressed backups
    CONFIGURE DEVICE TYPE DISK BACKUP TYPE TO COMPRESSED BACKUPSET;

    Automation

    Scheduled daily execution of DELETE OBSOLETE via cron:

    -- Add to RMAN backup script
    DELETE NOPROMPT OBSOLETE;
    
    # Crontab entry for daily cleanup
    0 2 * * * /path/to/rman_cleanup.sh >> /tmp/rman_cleanup.log 2>&1

    Key Takeaways

    • RMAN retention policy does not enforce deletion automatically — you must schedule DELETE OBSOLETE
    • Backup lifecycle management must include automation
    • Monitor FRA usage proactively to prevent saturation
    • Enable backup optimization to avoid redundant data
    • Set archivelog deletion policy to control archivelog lifecycle

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn