Tag: database

Applying Oracle 19c Apr 2026 CPU Patch (19.31) on a Single Instance CDB

In this article, we are going to demonstrate the steps to apply the Oracle April 2026 Critical Patch Update (CPU) on a Single Instance Oracle 19c Container Database (CDB). Here we will apply combo patch 39062931 which contains 39034528 (Database Apr 2026 Release Update 19.31.0.0.260421) and 38906621 (OJVM Component Release Update 19.31.0.0.260421). We also cover an important real-world scenario where a previously applied overlay patch 34672698 (ORA-00800 fix) conflicted with the Apr 2026 RU — confirming that the fix is now natively included in the RU itself.

Follow the high-level plan to apply patch 39062931 (19.31.0.0.260421).

Go through the README documents of patch IDs before proceeding with patch apply.
Download the required patches from My Oracle Support.
Upgrade OPatch utility to the latest version as required by the patch README.
Perform pre-checks at CDB and PDB level — invalid objects, patch history, registry components.
Take Oracle Home TAR backup.
Blackout in OEM and disable crontab jobs before maintenance.
Shutdown database and listener.
Apply DB Release Update (39034528).
Apply OJVM Release Update (38906621).
Validate overlay patch (34672698) — check if still required post RU upgrade.
Start database and listener.
Execute datapatch on the database and validate.
Uncomment crontab and remove OEM blackout.
Perform post-checks and compare with pre-check data.

Environment Details:

COMPONENTS	DETAILS
Oracle Home	/u01/app/oracle/product/19c/db_1
DB Version	19.30.0.0.0 (Jan 2026 RU)
OPatch Version (Before)	12.2.0.1.49
OPatch Version (After)	12.2.0.1.51
DB Type	Single Instance, Multitenant (CDB+PDB)
CDB Name	DEVCDB
PDBs	DEVPDB1, DEVPDB2
OS	Linux x86-64
Combo Patch	39062931 (19.31.0.0.260421)
DB RU Patch	39034528
OJVM Patch	38906621
Overlay Patch	34672698 (validated — not reapplied)

1. Pre-Checks

1.1: Capture invalid objects at CDB level.

sqlplus / as sysdba

Set lines 230
set pages 100
col OBJECT_NAME for a40
col OBJECT_TYPE for a40
col OWNER for a20
select OBJECT_NAME, OBJECT_TYPE, OWNER,
       to_char(LAST_DDL_TIME,'DD-MON-YYYY HH24:MI:SS'), status
from dba_objects
where status='INVALID'
order by owner;

no rows selected

1.2: Capture patch history from dba_registry_sqlpatch.

select patch_id, action, status, action_time, description
from dba_registry_sqlpatch;

PATCH_ID	ACTION	STATUS	ACTION_TIME	DESCRIPTION
38523609	APPLY	SUCCESS	03-FEB-26	OJVM RELEASE UPDATE: 19.30.0.0.260120
38632161	APPLY	SUCCESS	03-FEB-26	Database Release Update: 19.30.0.0.260120

44 rows selected.

1.3: Capture DB registry components.

SELECT COMP_ID, comp_name, version, status FROM dba_registry;

COMP_ID	COMP_NAME	VERSION	STATUS
CATALOG	Oracle Database Catalog Views	19.30.0.0.0	VALID
CATPROC	Oracle Database Packages and Types	19.30.0.0.0	VALID
JAVAVM	JServer JAVA Virtual Machine	19.30.0.0.0	VALID
XML	Oracle XDK	19.30.0.0.0	VALID
XDB	Oracle XML Database	19.30.0.0.0	VALID
ORDIM	Oracle Multimedia	19.30.0.0.0	VALID
CONTEXT	Oracle Text	19.30.0.0.0	VALID
… 15 rows total. All components VALID.

1.4: Connect to each PDB and capture invalid objects as baseline.

alter session set container=DEVPDB1;
select OBJECT_NAME, OBJECT_TYPE, OWNER, status
from dba_objects where status='INVALID'
order by owner;

📌 Note: Document all pre-existing invalid objects at PDB level before patching. These serve as a baseline for post-patch comparison. Do not attempt to fix pre-existing invalid objects before patching.

1.5: Capture current patch inventory from OS.

[oracle@dbhost ~]$ opatch lspatches

PATCH_ID	DESCRIPTION
34672698	ORA-00800 SOFT EXTERNAL ERROR, ARGUMENTS [SET PRIORITY FAILED], [VKTM], DISM(16)
38523609	OJVM RELEASE UPDATE: 19.30.0.0.260120 (38523609)
38632161	Database Release Update: 19.30.0.0.260120 (38632161)
29585399	OCW RELEASE UPDATE 19.3.0.0.0 (29585399)

OPatch succeeded.

2. Download Patches

2.1: Create a dedicated directory for April 2026 patches.

mkdir -p /u01/app/oracle/patches/APR_2026
cd /u01/app/oracle/patches/APR_2026

2.2: Download OPatch (patch 6880880).

⚠️ Important: Always download the Linux-x86-64 version of OPatch, not the generic LINUX (32-bit) version. The 32-bit LINUX zip will cause oui/lib/linux directory errors during patch apply.

wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
     "<DOWNLOAD_URL_FROM_MOS>" -O p6880880_190000_Linux-x86-64.zip

2.3: Download Combo Patch and Overlay Patch.

wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
     "<DOWNLOAD_URL_FROM_MOS>" -O p39062931_190000_Linux-x86-64.zip

wget --user=<MOS_EMAIL> --password='<MOS_PASSWORD>' --no-check-certificate -c \
     "<DOWNLOAD_URL_FROM_MOS>" -O p34672698_1930000DBRU_Linux-x86-64.zip

2.4: Unzip patches and verify structure.

unzip p39062931_190000_Linux-x86-64.zip
unzip p34672698_1930000DBRU_Linux-x86-64.zip

ls -l 39062931/
drwxr-x---. 4 oracle oinstall  6144 38906621    -- OJVM RU
drwxr-x---. 4 oracle oinstall  6144 39034528    -- DB RU
-rw-r--r--. 1 oracle oinstall   118 README.html

3. Upgrade OPatch

3.1: Check current OPatch version.

[oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
OPatch Version: 12.2.0.1.49
OPatch succeeded.

The April 2026 RU requires OPatch 12.2.0.1.51 or higher. We need to upgrade.

3.2: Backup and replace OPatch.

mv $ORACLE_HOME/OPatch $ORACLE_HOME/OPatch_old_12.2.0.1.49
unzip p6880880_190000_Linux-x86-64.zip -d $ORACLE_HOME

3.3: Verify new OPatch version.

[oracle@dbhost ~]$ $ORACLE_HOME/OPatch/opatch version
OPatch Version: 12.2.0.1.51
OPatch succeeded.

💡 Tip: If OPatch fails with Invalid maximum heap size: -Xmx5120m, set the following before retrying:

export OPATCH_JRE_MEMORY_OPTIONS="-Xmx1024m"

Add this to ~/.bash_profile to make it permanent.

4. Pre-Patching Activities

4.1: Comment out crontab entries to prevent jobs from running during maintenance.

[oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/^/##APR_2026##&/g' | /usr/bin/crontab)

4.2: Place OEM blackout if OEM is configured for this target.

📌 Note: If the environment has application-level database triggers, disable them before shutting down the database. Consult the application team before making any trigger changes. Re-enable them after patching is complete.

5. Shutdown Database and Listener

5.1: Shutdown the database and listener, and take an Oracle Home TAR backup.

sqlplus / as sysdba
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> exit;

lsnrctl stop LISTENER

# Oracle Home TAR backup (run in background)
nohup tar -zcvf /u01/app/oracle/patches/oracle_home_APR2026.tar.gz $ORACLE_HOME &

6. Apply DB Release Update (39034528)

6.1: Set PATH and run the prerequisite conflict check.

export PATH=$ORACLE_HOME/OPatch:$PATH
cd /u01/app/oracle/patches/APR_2026/39062931/39034528

opatch prereq CheckConflictAgainstOHWithDetail -ph ./

ZOP-47: Superset Patch 39034528 has Subset Patch 38632161 which has overlay patches
[34672698] and these overlay patches conflict with Superset Patch.
OPatch failed with error code 73

⚠️ Action Required: The prereq check identified that overlay patch 34672698 conflicts with the new DB RU. Roll it back before applying the DB RU.

6.2: Rollback overlay patch 34672698.

opatch rollback -id 34672698
Patch 34672698 successfully rolled back.
OPatch succeeded.

6.3: Re-run prereq check after rollback.

opatch prereq CheckConflictAgainstOHWithDetail -ph ./
Prereq "checkConflictAgainstOHWithDetail" passed.
OPatch succeeded.

6.4: Apply the DB Release Update.

opatch apply
Patch 39034528 successfully applied.
Sub-set patch [38632161] has become inactive due to the application of a super-set patch [39034528].
OPatch succeeded.

6.5: Verify the patch is applied.

opatch lsinventory | grep 39034528
39034528  Database Release Update : 19.31.0.0.260421

7. Apply OJVM Release Update (38906621)

7.1: Run prereq check and apply the OJVM patch.

export PATH=$ORACLE_HOME/OPatch:$PATH
cd /u01/app/oracle/patches/APR_2026/39062931/38906621

opatch prereq CheckConflictAgainstOHWithDetail -ph ./
Prereq "checkConflictAgainstOHWithDetail" passed.
OPatch succeeded.

opatch apply
Patch 38906621 successfully applied.
Sub-set patch [38523609] has become inactive due to the application of a super-set patch [38906621].
OPatch succeeded.

7.2: Verify both patches are listed.

[oracle@dbhost ~]$ opatch lspatches

PATCH_ID	DESCRIPTION
39034528	Database Release Update: 19.31.0.0.260421 (39034528)
38906621	OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
29585399	OCW RELEASE UPDATE 19.3.0.0.0 (29585399)

OPatch succeeded.

8. Overlay Patch (34672698) — Conflict with Apr 2026 RU

The overlay patch 34672698 was previously applied on top of the Jan 2026 RU to fix ORA-00800: SOFT EXTERNAL ERROR, ARGUMENTS: [SET PRIORITY FAILED], [VKTM], DISM(16). It was rolled back in Step 6. Before reapplying, always run the prereq check.

📌 Note on oradism permissions: Some environments require a temporary oradism ownership change to oracle before applying the overlay patch. Always check the current permissions first:

ls -l $ORACLE_HOME/bin/oradism

If oradism is already oracle-owned (e.g., -rwxr-x---. 1 oracle oinstall), no root permission change is required. Only apply the chown/chmod steps if oradism is root-owned with setuid bit.

8.1: Run prereq check before reapplying the overlay patch.

cd /u01/app/oracle/patches/APR_2026/34672698
opatch prereq CheckConflictAgainstOHWithDetail -ph ./

ZOP-40: The patch(es) has conflicts with other patches installed in the Oracle Home.
Following patches have conflicts: 34672698, 39034528
OPatch failed with error code 73

✔ Expected Result: The prereq confirms overlay patch 34672698 conflicts with Apr 2026 DB RU (39034528). Oracle has included the ORA-00800 fix natively in the Apr 2026 RU. The overlay patch is no longer required and must NOT be reapplied.

⚠️ Important: This is expected behaviour when upgrading RUs. An overlay patch applied on a previous RU may become redundant when the next RU natively includes the same fix. Always run prereq check before reapplying overlay patches after an RU upgrade.

8.2: Verify the overlay patch is absent from inventory.

opatch lsinventory | grep 34672698
[oracle@dbhost 34672698]$

✔ Confirmed: No output from the grep confirms the overlay patch is correctly absent from the Oracle Home.

9. Start Database and Run Datapatch

9.1: Start the database and open all PDBs.

sqlplus / as sysdba
SQL> startup;
ORACLE instance started.
Database mounted.
Database opened.

SQL> alter pluggable database all open;
Pluggable database altered.

SQL> exit;

lsnrctl start LISTENER

9.2: Run datapatch to apply the SQL portion of the patches.

cd $ORACLE_HOME/OPatch
./datapatch -verbose

SQL Patching tool version 19.31.0.0.0 Production on Tue May 5 13:17:39 2026

Installation queue:
  For CDB$ROOT PDB$SEED DEVPDB1 DEVPDB2:
    Rollback: 38523609 (OJVM 19.30)
    Apply: 39034528 (DB RU 19.31)
    Apply: 38906621 (OJVM 19.31)

Patch 38523609 rollback (CDB$ROOT): SUCCESS
Patch 39034528 apply (CDB$ROOT): SUCCESS
Patch 38906621 apply (CDB$ROOT): SUCCESS
Patch 38523609 rollback (DEVPDB1): SUCCESS
Patch 39034528 apply (DEVPDB1): SUCCESS
Patch 38906621 apply (DEVPDB1): SUCCESS
Patch 38523609 rollback (DEVPDB2): SUCCESS
Patch 39034528 apply (DEVPDB2): SUCCESS
Patch 38906621 apply (DEVPDB2): SUCCESS

SQL Patching tool complete on Tue May 5 13:24:18 2026

📌 Note: Datapatch applies the SQL changes to both CDB and all open PDBs. Wait for it to complete fully before proceeding. Do not interrupt the process.

10. Post-Patching Activities

10.1: Uncomment crontab entries.

[oracle@dbhost ~]$ (/usr/bin/crontab -l | /bin/sed 's/##APR_2026##//g' | /usr/bin/crontab)

10.2: Remove OEM blackout if applicable.

📌 Note: If application-level database triggers were disabled before patching, re-enable them now after confirming datapatch completed successfully. Consult the application team before enabling.

11. Post-Checks

11.1: Verify patch registry — all entries should show SUCCESS.

select patch_id, action, status, action_time, description
from dba_registry_sqlpatch
order by action_time;

PATCH_ID	ACTION	STATUS	ACTION_TIME	DESCRIPTION
38523609	ROLLBACK	SUCCESS	05-MAY-26 01:20:47	OJVM RELEASE UPDATE: 19.30.0.0.260120
38906621	APPLY	SUCCESS	05-MAY-26 01:21:02	OJVM RELEASE UPDATE: 19.31.0.0.260421
39034528	APPLY	SUCCESS	05-MAY-26 01:21:56	Database Release Update: 19.31.0.0.260421

11.2: Verify all DB registry components are VALID.

SELECT COMP_ID, comp_name, version, status FROM dba_registry;

COMP_ID	COMP_NAME	VERSION	STATUS
CATALOG	Oracle Database Catalog Views	19.31.0.0.0	VALID
CATPROC	Oracle Database Packages and Types	19.31.0.0.0	VALID
JAVAVM	JServer JAVA Virtual Machine	19.31.0.0.0	VALID
XML	Oracle XDK	19.31.0.0.0	VALID
XDB	Oracle XML Database	19.31.0.0.0	VALID
ORDIM	Oracle Multimedia	19.31.0.0.0	VALID
CONTEXT	Oracle Text	19.31.0.0.0	VALID
… 15 rows total. All components VALID.

11.3: Confirm final patch inventory.

[oracle@dbhost ~]$ opatch lspatches

PATCH_ID	DESCRIPTION
39034528	Database Release Update: 19.31.0.0.260421 (39034528)
38906621	OJVM RELEASE UPDATE: 19.31.0.0.260421 (38906621)
29585399	OCW RELEASE UPDATE 19.3.0.0.0 (29585399)

OPatch succeeded.

✔ Confirmed: Overlay patch 34672698 is no longer listed. Its fix is now natively included in the Apr 2026 RU (39034528). This is the expected and correct final state.

11.4: Compare invalid objects at CDB and PDB level with pre-patch baseline. Compile any new invalid objects introduced by the patch:

@$ORACLE_HOME/rdbms/admin/utlrp.sql

Key MOS References

Document	Description
KB869205	Oracle Database 19c Apr 2026 RU Known Issues
KB137197	OJVM Conditional Rolling Install Details
244241.1	OPatch Support for RAC Rolling Patches
293369.1	OPatch Documentation List

Disclaimer: All server names, hostnames, database names, and environment-specific paths in this post have been anonymized. Steps and outputs are based on real production experience adapted for general use. Always test in a non-production environment before applying patches to production systems.

If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More patching walkthroughs at syedanwarahmedoracle.blog.

💬 Share Your Experience

Have you applied Oracle 19c CPU patches and hit a similar overlay patch conflict? Or found that a fix you applied months ago is now natively bundled into the next RU? Your real-world experiences help fellow DBAs tackle the same challenges. Drop a comment below — questions, observations, and feedback are always welcome.

01/05/2026

P1 Incident: /dbbackup Filesystem 100% Full — How We Traced, Fixed, and Recovered Two Failed RMAN Backups in One Night

🔴 Incident Overview

Severity	P1 — Two production database backups failed
Environment	Oracle 19c (19.30) on Linux x86-64
Backup Tool	RMAN with Recovery Catalog
Backup Volume	/dbbackup — 1TB LVM filesystem
Databases Affected	DBPRO01 (12:30 failure), DBPRO02 (22:30 failure)
Total Space Recovered	~492G (disk went from 100% → 52%)

1. The Alerts — Two Failures, Same Root Cause

It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:

RMAN-03009: failure of backup command on c4 channel at 22:32:13
ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
ORA-27072: File I/O error
Additional information: 4

Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.

A quick check confirmed it:

$ df -hP /dbbackup
Filesystem                        Size  Used Avail Use%
/dev/mapper/orabkupvg-orabkuplv1 1023G 1020G  3.3G 100%

1TB volume. 3.3G free. Completely full.

What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.

2. The Investigation — Folder by Folder

The first step was understanding what was consuming the disk. One command gave us the top-level picture:

$ du -sh /dbbackup/*
744G    DBPRC01
152G    ColdBackup_11April2026
 41G    DBPRC02
 26G    DBPRC03
 15G    DBPRO01
7.7G    JAN2026_CPU
7.7G    OCT2025_CPU
5.3G    infra_arch

744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.

Drilling into DBPRC01

$ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
7.5G    DiffInc_DBPRC01_fl4lumij
7.5G    DiffInc_DBPRC01_eg4lrnh5
7.5G    DiffInc_DBPRC01_b44lc7uh
...

Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.

$ ls /dbbackup/DBPRC01/rman/ | wc -l
1522

1,522 backup pieces. We checked the oldest and newest:

Oldest file on disk:  2022-05-07
Newest file on disk:  2026-04-25

Four years of backup files on disk — or so we thought.

3. The RMAN Investigation — Where Things Got Interesting

We connected RMAN to the database and ran the retention check:

RMAN> SHOW RETENTION POLICY;
CONFIGURE RETENTION POLICY TO REDUNDANCY 30;

REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.

Next logical step — check what RMAN considers obsolete:

RMAN> REPORT OBSOLETE;
no obsolete backups found

Nothing? With 1,522 files on disk?

We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:

SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
       TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
       COUNT(*) total_pieces
FROM v$backup_piece_details
WHERE status = 'A';

OLDEST          NEWEST          TOTAL_PIECES
03-DEC-2025     25-APR-2026     1541

The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.

This was the key insight: RMAN was not broken. The retention policy itself was the problem.

4. Root Cause — The Architecture Trap

The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:

connect target rman/password@DBPRO01
...
format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
(database);
...
delete obsolete;

DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.

This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.

The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.

The math:

Retention policy	REDUNDANCY 30
Level 0 frequency	Weekly (Sundays)
Effective retention period	~30 weeks / 7.5 months
Backup tracking since	December 2025 (~5 months)
Result	delete obsolete finds nothing — ever
Daily backup size	~7–7.5G per run
Total accumulated	744G

Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.

5. Secondary Findings During Investigation

OCT2025 CPU Patch Artifacts (7.7G)

The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.

$ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
# Empty — neither Oct 2025 patch in inventory anymore

5-Year-Old Pre-Migration Export Dumps

Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.

Recovery Catalog Version Mismatch

The original RMAN log flagged this warning:

PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current

The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.

6. The Fix — Emergency Space Recovery

With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:

RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';

This command does three things atomically:

Queries catalog/controlfile for all pieces completed before the cutoff date
Deletes the physical files from disk
Removes the records from RMAN catalog — no orphaned entries, no catalog drift

The output scrolled for several minutes:

deleted backup piece
backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
deleted backup piece
backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
...
Deleted 1072 objects

1,072 backup pieces deleted. Catalog updated. Disk checked:

BEFORE:  Used 1020G  Avail 3.3G  (100%)
AFTER:   Used  536G  Avail 488G   (53%)

Then the OCT2025_CPU directory was removed:

$ rm -rf /dbbackup/OCT2025_CPU/
$ df -hP /dbbackup
Used 528G  Avail 495G  (52%)

Final result: 495G free. Disk at 52%.

Both failed backups were re-submitted immediately and ran successfully in parallel:

$ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
$ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &

$ jobs -l
[1] Running   nohup sh ...rman_backup_DBPRO01.sh &
[2] Running   nohup sh ...rman_backup_DBPRO02.sh &

7. Incident Timeline

12:30	DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full)
22:30	DBPRO02 Level 0 backup fails — same errors, all 4 channels
23:08	Investigation begins — df -hP /dbbackup confirms 100% full
23:15	DBPRC01 directory identified as 744G consumer
23:25	RMAN connected — REDUNDANCY 30 discovered
23:35	Architecture confirmed — PRO databases backing up into PRC directories
23:45	Root cause confirmed — 7.5-month retention, delete obsolete finds nothing
23:50	DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed
23:51	1,072 pieces deleted — disk drops to 53%
23:55	OCT2025_CPU removed — disk at 52%, 495G free
00:00	Both backup jobs re-submitted and running successfully

8. Permanent Fix Recommendations

Fix 1 — Change Retention Policy to RECOVERY WINDOW

RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;

REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.

Fix 2 — Add Pre-Backup Space Check to Cron Script

#!/bin/bash
BACKUP_FS="/dbbackup"
THRESHOLD=20

AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')

if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
  echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
    | mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
  exit 1
fi

A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.

Fix 3 — Upgrade the Recovery Catalog

RMAN> CONNECT TARGET /
RMAN> CONNECT CATALOG rman/password@rmancat
RMAN> UPGRADE CATALOG;
RMAN> UPGRADE CATALOG;   -- run twice as prompted

The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.

Fix 4 — Filesystem Monitoring Alert

The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.

9. Key Takeaways for Oracle DBAs

REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.

Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.

10. Commands Reference — Quick Cheat Sheet

-- Check retention policy
RMAN> SHOW RETENTION POLICY;

-- Preview what would be deleted (dry run)
RMAN> REPORT OBSOLETE;
RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';

-- Emergency cleanup — delete pieces older than 30 days
RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';

-- Standard cleanup based on retention policy
RMAN> DELETE NOPROMPT OBSOLETE;

-- Sync catalog after any OS-level file operations
RMAN> CROSSCHECK BACKUP;
RMAN> DELETE NOPROMPT EXPIRED BACKUP;

-- Change to time-based retention (recommended)
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;

-- Check backup piece date range in control file
SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
       TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
       COUNT(*) total_pieces
FROM v$backup_piece_details
WHERE status = 'A';

-- Check backup history
SELECT session_key, input_type, status,
       TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
       output_bytes_display, time_taken_display
FROM v$rman_backup_job_details
ORDER BY start_time DESC;

Conclusion

What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.

That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.

If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.

28/04/2026

Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

Introduction

One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

Environment Overview

Component	Details
Database	Oracle 11g R2 (11.2.0.4.0)
Architecture	RAC — 2 Nodes
Database Role	PRIMARY — READ WRITE, ARCHIVELOG mode
OS	Linux (x86_64)
FRA Location	ASM Diskgroup +ARCHIVELOG
FRA Size (pre-incident)	199 GB
FRA Size (post-fix)	400 GB

The Incident: What We Saw

The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

ORA-00257: archiver error. Connect internal only, until freed.
ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance PRODDB1 - Archival Error

The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

Step 1: Check Redo Log Status

First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

SELECT group#, status, archived, members FROM v$log ORDER BY group#;

GROUP#   STATUS           ARCHIVED   MEMBERS
------   ---------------  ---------  -------
1        ACTIVE           NO         2
2        CURRENT          NO         2
3        INACTIVE         YES        2
4        INACTIVE         YES        2

Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

Step 2: Confirm the Archiver Status

-- Check both RAC instances
SELECT inst_id, instance_name, host_name, version,
       database_role, open_mode, log_mode
FROM gv$instance ORDER BY inst_id;

INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
-------  ----------  ------------  -------------  ----------  ----------
      1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
      2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG

SELECT instance_name, status, archiver FROM v$instance;

INSTANCE_NAME    STATUS       ARCHIVER
---------------- ------------ --------
PRODDB1          OPEN         STOPPED

ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

Step 3: Check FRA Usage

SELECT
    space_limit / (1024*1024*1024)       AS limit_gb,
    space_used  / (1024*1024*1024)       AS used_gb,
    space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
    ROUND((space_used / space_limit) * 100, 2) AS pct_used
FROM v$recovery_file_dest;

LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
--------   -------   --------------   --------
199        198.99     0.00            100.00

100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

Step 4: Identify What Was Consuming the FRA

SELECT file_type,
    ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
    ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
    number_of_files
FROM v$recovery_area_usage
ORDER BY space_used DESC;

FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
-----------     -------   --------------   ---------------
ARCHIVED LOG    38.50     0.05             1,200+
BACKUP PIECE    10.20     0.07             45
FLASHBACK LOG    1.30     0.00             12

Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
       COUNT(*) AS logs,
       ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
FROM v$archived_log
WHERE dest_id=1
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
ORDER BY 1;

Hour	Archived Logs	GB Generated
Daytime (typical)	2–26	< 1.3 GB
01:00 AM (nightly)	440–446	~31 GB ⚠
03:00–04:00 AM	191–411	11–28 GB ⚠
05:00–06:00 AM	537–796	34–50 GB ⚠

Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

The Resolution — Three Actions

Action 1: Manually Delete Archive Logs from ASM

Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

su - grid
asmcmd

ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
# Recent logs retained to preserve recoverability

After ASMCMD deletion, the RMAN catalog was resynced:

RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> DELETE EXPIRED ARCHIVELOG ALL;

Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

Action 2: Increase FRA Size Online

-- Increase FRA from 199GB to 400GB (online, no restart required)
ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

ARC1: Archival started
ARC2: Archival started
ARC3: Archival started
ORACLE Instance PRODDB1 - Archival Restarted

Applications reconnected and transactions resumed within seconds.

Action 3: Verify Database and Cluster Status

SELECT instance_name, archiver FROM v$instance;
-- PRODDB1   STARTED

SELECT round(space_used/1024/1024/1024,2) used_gb,
       round(space_limit/1024/1024/1024,2) limit_gb,
       round((space_used/space_limit)*100,2) used_pct
FROM v$recovery_file_dest;
-- 202.80 GB / 400 GB / 50.70%

SELECT group#, status, archived FROM v$log ORDER BY group#;
-- All groups INACTIVE/YES except CURRENT group

Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

Root Cause Confirmed: Archive Log Backups Were Not Running

With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

Complete Incident Chain:
ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

Fix: Correct the Cron Environment

Broken cron entry:

05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

Fixed cron entry:

05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

ARCH Backup Reruns and Completes

After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

. /home/oracle/.bash_profile
/home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

Preventive Measures Put in Place

Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

Key Takeaways for Oracle DBAs

ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

Conclusion

A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

Syed Anwar Ahmed | Oracle Apps DBA

11/04/2026

Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem
A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.

Initial Observations

Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.

The Turning Point

Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:
```
SELECT status,
       ROUND(SUM(bytes)/1024/1024,2) MB
FROM dba_undo_extents
GROUP BY status;
```
Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.

Key Insight

EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.

Real Usage Calculation
```
SELECT ROUND(
       SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
       / SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents;
```
Actual usage was only ~45% — not 97%.

Root Cause of the False Alert

The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.

The Correct Monitoring Approach
```
SELECT tablespace_name,
       ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
/ SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents
GROUP BY tablespace_name;
```
Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts

Key Takeaways
- UNDO is not a regular tablespace — it has its own lifecycle
- DBA_FREE_SPACE is misleading for UNDO monitoring
- Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
- Build context-aware monitoring that aligns with Oracle internals
- Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn | GitLab
30/03/2026
Rethinking UNDO Tablespace Monitoring in Multitenant Environments: A Case Study on APPS_UNDOTS1
Introduction

A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.

This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.

The Incident

An alert was triggered showing:
```
Tablespace              TOTAL_MB     USED_MB     FREE_MB PCT_USED
-------------------- ----------- ----------- ----------- --------
APPS_UNDOTS1           28,863.25   28,681.56      181.69    99.37
```
At first glance, this indicated:
- UNDO nearly full
- Minimal free space
- Risk of ORA-30036 / ORA-01555
Environment Context

This issue occurred in a Multitenant environment with Local UNDO enabled:
- UNDO tablespace: APPS_UNDOTS1
- Scope: PDB-level
- Each PDB maintains its own UNDO
UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.

Investigation Approach

Rather than relying on % used, a state-based analysis was performed.

Step 1: Active Transactions
```
SELECT COUNT(*) FROM v$transaction;
```
Observation:
- Minimal active transactions
- No ongoing workload pressure
Step 2: UNDO Extent Analysis
```
SELECT status, SUM(bytes)/1024/1024 MB
FROM dba_undo_extents
GROUP BY status;
```
Key Observation
```
STATUS        MB         PCT
-----------   --------   -------
ACTIVE        ~3 MB      ~0.01%
UNEXPIRED     ~51 MB     ~0.18%
EXPIRED       ~28 GB     ~99.81%
```
💡 Critical Insight

~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable

This means:
- No real space pressure
- No transaction risk
- Tablespace is effectively available
Understanding UNDO Behavior

UNDO extents follow a lifecycle:
```
ACTIVE → UNEXPIRED → EXPIRED → REUSED
```
State Description Reusable
ACTIVE Used by active transactions ❌
UNEXPIRED Retained for consistency ⚠️
EXPIRED No longer needed ✅

UNDO extents are not physically freed but logically reused by Oracle based on demand.

Why the Alert Was Misleading

The alert was based on:
```
SELECT * FROM dba_free_space;
```
Limitation
- Shows only physically free space
- Does not reflect reusable UNDO
- Ignores Oracle’s internal reuse mechanism
Root Cause

A long-running concurrent request:
- Ran for ~30 hours
- Performed heavy DELETE operations
- Generated significant UNDO
During execution:
- UNDO reached ~99% → real pressure
After completion:
- UNDO became EXPIRED
- Space became reusable
- Alert persisted → false positive
When UNDO Is Actually a Problem
```
ACTIVE      > 20–30%
UNEXPIRED   very high
EXPIRED     very low
```
UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.

Recommended Monitoring Approach
```
WITH undo AS (
    SELECT status, SUM(bytes)/1024/1024 mb
    FROM dba_undo_extents
    GROUP BY status
),
total AS (
    SELECT SUM(mb) total_mb FROM undo
)
SELECT 
    u.status,
    ROUND(u.mb,2) AS mb,
    ROUND((u.mb / t.total_mb) * 100, 2) AS pct
FROM undo u, total t;
```
Improved Alert Logic
- ACTIVE > 20% → Critical
- UNEXPIRED > 80% → Warning
- Otherwise → Normal
Key Takeaways
- UNDO is state-driven, not space-driven
- EXPIRED undo = reusable capacity
- % used is misleading
- Always analyze at PDB level in multitenant setups
- Monitoring must align with Oracle internals
Conclusion

UNDO tablespace monitoring must evolve from:

❌ Static space-based metrics
➡️
✅ Dynamic lifecycle-based analysis

A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.

Level Insight

“UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”

Final Thought

👉 Don’t ask:
“How full is UNDO?”

👉 Ask:
“How much of it is actually in use?”

Author

Syed Anwar Ahmed
Oracle Apps DBA | Oracle EBS | Performance & Troubleshooting
27/03/2026
When ADOP Remembers Too Much: Fixing Patch Failures Caused by Stale Metadata in Oracle EBS
During an Oracle E-Business Suite ADOP patching cycle in a multi-node environment, the apply phase failed on one node while completing successfully on others. Despite retries — including downtime mode — the issue persisted, pointing to a deeper inconsistency within the patching framework.

Symptoms Observed
- ADOP session status: FAILED
- Patch applied successfully on some nodes, failed on admin node
- Repeated failures even with restart=no, abandon=yes, and downtime mode
- No immediate actionable error from standard logs
Timeline of Events
```
T0 -- Patch execution initiated (ADOP apply phase)
T1 -- Failure observed on admin node
T2 -- Retry using downtime mode -- Failure persists
T3 -- ADOP session review shows inconsistent state
T4 -- Internal metadata tables analyzed
T5 -- Cleanup performed (tables + restart directory)
T6 -- Patch re-executed -- Success across all nodes
```
Investigation

Step 1: Check ADOP Session State

Query the ADOP session status to understand the current state across all nodes:
```
-- Check current ADOP session status
SELECT session_id, node_name, phase, status,
       start_date, end_date
FROM applsys.ad_adop_sessions
ORDER BY start_date DESC;

-- Check apply phase status per node
SELECT s.session_id, n.node_name, p.phase_code,
       p.status, p.start_date, p.end_date
FROM applsys.ad_adop_sessions s,
     applsys.ad_adop_session_phases p,
     applsys.fnd_nodes n
WHERE s.session_id = p.session_id
AND p.node_id = n.node_id
ORDER BY p.start_date DESC;
```
The existing session showed status FAILED with the apply phase partially completed — a clear indicator of inconsistent execution state across nodes.

Step 2: Check adalldefaults.txt

Reviewed the defaults file for any relevant configuration:
```
cat $APPL_TOP/admin/$TWO_TASK/adalldefaults.txt | grep -i missing
-- Key parameter found:
-- MISSING_TRANSLATED_VERSION = No
```
Modifying and retrying with this parameter had no impact, confirming the issue was not translation-related.

Step 3: Check Install Processes Table
```
-- Check for stale entries in FND_INSTALL_PROCESSES
SELECT COUNT(*) FROM applsys.fnd_install_processes;

-- View stale entries in detail
SELECT process_status, process_name, last_update_date
FROM applsys.fnd_install_processes
ORDER BY last_update_date DESC;

-- Check AD_DEFERRED_JOBS
SELECT COUNT(*) FROM applsys.ad_deferred_jobs;
SELECT * FROM applsys.ad_deferred_jobs;
```
Observation: FND_INSTALL_PROCESSES contained stale entries from the failed session. AD_DEFERRED_JOBS was empty.

Root Cause

The failure was caused by stale and inconsistent ADOP metadata tables — specifically APPLSYS.FND_INSTALL_PROCESSES and APPLSYS.AD_DEFERRED_JOBS. ADOP internally relies on these tables to track patch progress checkpoints, deferred job execution, and restart state management. When these tables retain entries from failed or incomplete sessions, ADOP assumes an incorrect execution state, leading to patch reconciliation failure, apply phase breakdown, and node-level inconsistencies.

Resolution Steps

Step 1: Backup Critical Tables
```
-- Always backup before any cleanup
CREATE TABLE applsys.fnd_install_processes_bak AS
SELECT * FROM applsys.fnd_install_processes;

CREATE TABLE applsys.ad_deferred_jobs_bak AS
SELECT * FROM applsys.ad_deferred_jobs;

-- Verify backups
SELECT COUNT(*) FROM applsys.fnd_install_processes_bak;
SELECT COUNT(*) FROM applsys.ad_deferred_jobs_bak;
```
Step 2: Drop Stale Metadata Tables

Dropping these tables forces ADOP to rebuild clean metadata during the next run:
```
DROP TABLE applsys.fnd_install_processes;
DROP TABLE applsys.ad_deferred_jobs;
```
Step 3: Reset the Restart Directory

The restart directory can silently preserve failure states. Back it up and create a fresh one:
```
cd $APPL_TOP/admin/$TWO_TASK

-- Backup existing restart directory
mv restart restart_bkp_$(date +%Y%m%d)

-- Create fresh restart directory
mkdir restart

-- Verify
ls -la | grep restart
```
Step 4: Re-run the Patch
```
adop phase=apply \
     patches=<patch_id> \
     restart=no \
     abandon=yes \
     apply_mode=downtime
```
The patch completed successfully across all nodes after the metadata cleanup.

Before vs After

Component Before Fix After Fix
ADOP Session Failed Successful
Node Consistency Partial Full
Restart Behavior Stuck Clean
Patch Execution Incomplete Completed

Key Takeaways
- ADOP is state-driven — even when logs appear clean, internal metadata drives execution decisions
- Partial success is a clue — if some nodes succeed and one fails, focus on local metadata, not the patch itself
- The restart directory matters — it can silently preserve failure states and must be validated before retrying
- Downtime mode is not a fix-all — even in downtime, ADOP still reads metadata tables; corruption persists unless cleaned
- Always backup before cleanup — never drop tables without creating a backup first
When NOT to Use This Approach

Avoid applying this fix if the issue is caused by missing database patches (ETCC warnings), file system or permission issues, incorrect patch sequencing, or environment misconfiguration. Always validate the root cause before performing any metadata cleanup.

This scenario highlights a subtle but critical behavior in ADOP — sometimes patch failures are not caused by the patch itself, but by what the system remembers about past attempts. By resetting stale metadata, we allow ADOP to re-evaluate the environment cleanly, leading to successful execution.

Have questions or faced a similar issue? Reach out at sdanwarahmed@gmail.com.
27/03/2026
Oracle EBS Performance Issue: Inactive Forms Session Holding TX Locks (RCA and Resolution)
In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.

Observed Symptoms
- Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
- Concurrent programs stuck in running state
- Increased database wait event: enq: TX - row lock contention
Investigation

Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.
```
-- Identify blocking sessions
SELECT blocking_session, sid, wait_class, event
FROM v$session
WHERE blocking_session IS NOT NULL;

-- Detect inactive sessions with active transactions
SELECT s.sid, s.serial#, s.status, s.program,
       s.username, t.start_time,
       ROUND(s.last_call_et/3600,2) hrs_inactive
FROM v$session s, v$transaction t
WHERE s.saddr = t.ses_addr
AND s.status = 'INACTIVE'
ORDER BY hrs_inactive DESC;

-- Analyze full wait chain
SELECT * FROM dba_wait_chains;
```
The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.

Root Cause

An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.
```
-- The problematic SQL pattern
SELECT *
FROM WSH_DELIVERY_DETAILS
WHERE ROWID = :B1
FOR UPDATE NOWAIT;
```
Resolution

The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:
```
-- Kill blocking session (only after full validation and approval)
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
```
Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.

Preventive Measures
- Implement idle session timeout policies
- Educate users on proper transaction handling in Oracle Forms
- Review custom code using FOR UPDATE — keep transactions short and commit promptly
- Monitor long-running and idle transactions proactively
Key Takeaways
- An inactive session can still hold active transactions and critical locks
- Always identify the root blocker — intermediate sessions are symptoms, not the cause
- Application-level inactivity does not mean database-level inactivity
- In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Oracle Database Session Spike Mystery: When Connection Pools Collide
In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.

The Incident

An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.
```
-- Quick session count check
SELECT COUNT(*) FROM gv$session;
-- Result: 216 (already returning to normal)

-- Session breakdown by status
SELECT status, COUNT(*) cnt
FROM gv$session
GROUP BY status
ORDER BY cnt DESC;
```
Root Cause

Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:
```
Old Sessions (Inactive) + New Sessions (Active) = Session Surge
```
As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.

Recommendations
- Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
- Monitor inactive session trends to detect abnormal accumulation early
- Configure appropriate idle timeout, maximum pool size, and session reuse settings
Key Takeaways
- Not all session spikes are database problems — check middleware behavior first
- High session count does not necessarily indicate database stress
- Transient issues still require analysis as they reveal architectural inefficiencies
- Database alerts can originate from upstream connection management behavior
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes
Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.

Observed Symptoms
```
ORA-00031: session marked for kill
Thread 1 cannot allocate new log
Private strand flush not complete
```
Phase 1: Interpreting ORA-00031 Correctly

ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.

Phase 2: Identifying the True Performance Signal

The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.

Phase 3: Evidence-Based Analysis (Read-Only)

Redo switch frequency was analyzed to validate system behavior:
```
SELECT
    TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
    COUNT(*) AS switches
FROM v$log_history
WHERE first_time > SYSDATE - 1
GROUP BY TRUNC(first_time, 'HH24')
ORDER BY 1;
```
Findings

Metric Observation
Average Switch Rate 5-7 per hour
Peak Rate 8-10 per hour during business hours
Off-Peak Rate 1-3 per hour

A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.

Why No Changes Were Made

In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.

Recommendations
- Continuously monitor redo switch frequency during peak windows
- Use collected data to justify future redo log sizing via change management
- Avoid unnecessary intervention when behavior is transient and non-impacting
- Distinguish informational alert log messages from actionable errors
Key Takeaways
- ORA-00031 is expected and harmless — it confirms session termination
- Redo allocation waits are transient under sustained load
- Proper analysis prevents unnecessary production intervention
- Not all alert log warnings indicate failure — some are early signals of workload growth
- The goal is not to eliminate every alert, but to understand which ones matter
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Oracle EBS Login Issue: Real-Time Production Incident and Fix (OACORE + WebLogic Filter)
📌 Introduction

In Oracle E-Business Suite (EBS) environments, login failures are often perceived as simple application issues. However, in complex architectures, they can originate from multiple interacting layers across the application and middleware stack.

In this blog, I’ll walk through a real-world production incident where an OACORE JVM issue combined with WebLogic security filtering resulted in complete login inaccessibility.

This case highlights the importance of analyzing both performance and security layers together when troubleshooting critical application outages.

⚠️ Issue Summary
- Users were unable to access the EBS login page
- Pages were hanging or not loading
- WebLogic console reported:
```
Connection rejected, filter blocked Socket
```
🔍 Initial Observation

From the application server:
- Load Average: ~10+
👉 This indicated:
- High CPU utilization
- System under heavy stress
- Potential JVM thread contention
🔬 Detailed Analysis
- One of the OACORE managed server JVMs became unresponsive
- Long-running threads caused thread pool exhaustion
- Incoming user requests began queueing
At the same time:
- WebLogic connection filter was actively enforcing access rules
- Legitimate requests were being rejected under stressed conditions
🧠 Understanding the Components

OACORE (Application Layer)

Handles:
- Login requests
- Forms processing
- Core application logic
If JVM threads are exhausted:
👉 Requests queue → login hangs

Oracle WebLogic Server Connection Filter

EBS environments may use:
```
oracle.apps.ad.tools.configuration.wls.filter.EBSConnectionFilterImpl
```
This filter:
- Enforces IP-based access control
- Overrides default allow rules
If misconfigured or stressed:
👉 Legitimate traffic may be blocked

🎯 Root Cause Analysis (RCA)

The login issue was not caused by a single failure point, but by a combination of application tier resource exhaustion and restrictive middleware-level access control.
- High CPU utilization and long-running threads caused one OACORE JVM to become unresponsive
- Thread pool exhaustion led to request queuing, preventing new login requests from being processed
Simultaneously:
- The WebLogic connection filter (EBSConnectionFilterImpl) enforced strict access control policies
- Under high load conditions, legitimate client requests were rejected with “filter blocked Socket”
This interaction between performance degradation and security enforcement amplified the impact, resulting in complete login inaccessibility despite partial system availability.

🛠️ Resolution Approach (Controlled & Safe)

The resolution approach focused on stabilizing the JVM layer while validating and correcting middleware-level access controls in a controlled manner.

🔹 Step 1: Identify Unresponsive JVM
```
ps -ef | grep oacore
```
✔ Identify JVM with abnormal CPU or stuck behavior

🔹 Step 2: Handle Stuck JVM (Controlled Action)

⚠️ Important Note:

Forcefully terminating JVM processes should NOT be performed without validation.

✔ Recommended Approach:
- Confirm the process is unresponsive
- Ensure no critical transactions are running
- Prefer controlled shutdown where possible
✔ Example (Only if fully unresponsive and approved):
```
kill -9 <PID>
```
👉 Node Manager can restart the JVM automatically after termination

🔹 Step 3: Rolling Restart of OACORE
```
admanagedsrvctl.sh stop oacore_server1
admanagedsrvctl.sh start oacore_server1
```
✔ Ensures clean JVM state
✔ Restores thread pool balance

🔹 Step 4: Validate WebLogic Connection Filter

⚠️ Important Note:

The WebLogic connection filter is a security control and should NOT be disabled permanently.

✔ In this case:
- Filter behavior was validated as part of troubleshooting
- Temporary relaxation was used to confirm impact
✔ Recommended Approach:
- Review allowed IP ranges
- Validate filter configuration
- Re-enable filter after correction
The filter was re-enabled after validation and correction of configuration.

🔹 Step 5: Restart Admin Server (If Required)
```
adadminsrvctl.sh stop
adadminsrvctl.sh start
```
✔ Ensures configuration changes are applied

✅ Final Outcome
- System load reduced from ~10 → ~1.5
- OACORE JVMs stabilized
- Login page restored
- Users successfully accessed the application
📊 Key Learnings
1. OACORE JVM issues can manifest as complete application outages due to thread exhaustion and request queuing.
2. WebLogic connection filters can unintentionally block legitimate traffic, especially under high load conditions.
3. Multi-layer failures (application + middleware + security) significantly amplify incident impact.
4. System load metrics (CPU, load average) provide early indicators of JVM stress and should be monitored proactively.
🧩 Preventive Measures
- Monitor JVM thread utilization proactively
- Review connection filter configurations periodically
- Avoid peak concurrent load spikes
- Implement alerting for high CPU / load conditions
- Validate session and request handling patterns
🔐 Governance Considerations
- Avoid forceful JVM termination without proper validation
- Do not disable security controls without understanding impact
- Follow change management procedures in production
- Coordinate with application and security teams before changes
🏁 Conclusion

This incident demonstrates that login failures in Oracle EBS are not always isolated to a single component but can result from complex interactions across application performance and middleware security layers.

The combination of JVM resource exhaustion and connection filtering behavior created a compounded failure scenario, leading to complete login disruption.

A structured, multi-layer troubleshooting approach—focused on performance, configuration, and governance—enabled effective resolution while minimizing risk.

This reinforces the importance of analyzing both system behavior and security controls together when addressing critical production incidents.

💡 Pro Tip

When troubleshooting Oracle EBS login issues, always validate both:
- JVM health (thread utilization, CPU load)
- Middleware controls (connection filters, access rules)
Ignoring either layer can lead to incomplete or misleading diagnosis.
25/03/2026

State	Description	Reusable
ACTIVE	Used by active transactions	❌
UNEXPIRED	Retained for consistency	⚠️
EXPIRED	No longer needed	✅

Component	Before Fix	After Fix
ADOP Session	Failed	Successful
Node Consistency	Partial	Full
Restart Behavior	Stuck	Clean
Patch Execution	Incomplete	Completed

Metric	Observation
Average Switch Rate	5-7 per hour
Peak Rate	8-10 per hour during business hours
Off-Peak Rate	1-3 per hour

Tag: database

1. Pre-Checks

2. Download Patches

3. Upgrade OPatch

4. Pre-Patching Activities

5. Shutdown Database and Listener

6. Apply DB Release Update (39034528)

7. Apply OJVM Release Update (38906621)

8. Overlay Patch (34672698) — Conflict with Apr 2026 RU

9. Start Database and Run Datapatch

10. Post-Patching Activities

11. Post-Checks

Key MOS References

💬 Share Your Experience

🔴 Incident Overview

1. The Alerts — Two Failures, Same Root Cause

2. The Investigation — Folder by Folder

Drilling into DBPRC01

3. The RMAN Investigation — Where Things Got Interesting

4. Root Cause — The Architecture Trap

5. Secondary Findings During Investigation

OCT2025 CPU Patch Artifacts (7.7G)

5-Year-Old Pre-Migration Export Dumps

Recovery Catalog Version Mismatch

6. The Fix — Emergency Space Recovery

7. Incident Timeline

8. Permanent Fix Recommendations

Fix 1 — Change Retention Policy to RECOVERY WINDOW

Fix 2 — Add Pre-Backup Space Check to Cron Script

Fix 3 — Upgrade the Recovery Catalog

Fix 4 — Filesystem Monitoring Alert

9. Key Takeaways for Oracle DBAs

10. Commands Reference — Quick Cheat Sheet

Conclusion

Introduction

Environment Overview

The Incident: What We Saw

Step 1: Check Redo Log Status

Step 2: Confirm the Archiver Status

Step 3: Check FRA Usage

Step 4: Identify What Was Consuming the FRA

Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

The Resolution — Three Actions

Action 1: Manually Delete Archive Logs from ASM

Action 2: Increase FRA Size Online

Action 3: Verify Database and Cluster Status

Root Cause Confirmed: Archive Log Backups Were Not Running

Fix: Correct the Cron Environment

ARCH Backup Reruns and Completes

Preventive Measures Put in Place

Key Takeaways for Oracle DBAs

Conclusion

Initial Observations

The Turning Point

Key Insight

Real Usage Calculation

Root Cause of the False Alert

The Correct Monitoring Approach

Key Takeaways

Introduction

The Incident

Environment Context

Investigation Approach

Step 1: Active Transactions

Step 2: UNDO Extent Analysis

Key Observation

💡 Critical Insight

Understanding UNDO Behavior

Why the Alert Was Misleading

Limitation

Root Cause

When UNDO Is Actually a Problem

Recommended Monitoring Approach

Improved Alert Logic

Key Takeaways

Conclusion

Level Insight

Final Thought

Author

Symptoms Observed