Tag: dba

  • P1 Incident: /dbbackup Filesystem 100% Full — How We Traced, Fixed, and Recovered Two Failed RMAN Backups in One Night

    🔴 Incident Overview

    Severity P1 — Two production database backups failed
    Environment Oracle 19c (19.30) on Linux x86-64
    Backup Tool RMAN with Recovery Catalog
    Backup Volume /dbbackup — 1TB LVM filesystem
    Databases Affected DBPRO01 (12:30 failure), DBPRO02 (22:30 failure)
    Total Space Recovered ~492G (disk went from 100% → 52%)

    1. The Alerts — Two Failures, Same Root Cause

    It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:

    RMAN-03009: failure of backup command on c4 channel at 22:32:13
    ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
    ORA-27072: File I/O error
    Additional information: 4
    

    Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.

    A quick check confirmed it:

    $ df -hP /dbbackup
    Filesystem                        Size  Used Avail Use%
    /dev/mapper/orabkupvg-orabkuplv1 1023G 1020G  3.3G 100%
    

    1TB volume. 3.3G free. Completely full.

    What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.


    2. The Investigation — Folder by Folder

    The first step was understanding what was consuming the disk. One command gave us the top-level picture:

    $ du -sh /dbbackup/*
    744G    DBPRC01
    152G    ColdBackup_11April2026
     41G    DBPRC02
     26G    DBPRC03
     15G    DBPRO01
    7.7G    JAN2026_CPU
    7.7G    OCT2025_CPU
    5.3G    infra_arch
    

    744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.

    Drilling into DBPRC01

    $ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
    7.5G    DiffInc_DBPRC01_fl4lumij
    7.5G    DiffInc_DBPRC01_eg4lrnh5
    7.5G    DiffInc_DBPRC01_b44lc7uh
    ...
    

    Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.

    $ ls /dbbackup/DBPRC01/rman/ | wc -l
    1522
    

    1,522 backup pieces. We checked the oldest and newest:

    Oldest file on disk:  2022-05-07
    Newest file on disk:  2026-04-25
    

    Four years of backup files on disk — or so we thought.


    3. The RMAN Investigation — Where Things Got Interesting

    We connected RMAN to the database and ran the retention check:

    RMAN> SHOW RETENTION POLICY;
    CONFIGURE RETENTION POLICY TO REDUNDANCY 30;
    

    REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.

    Next logical step — check what RMAN considers obsolete:

    RMAN> REPORT OBSOLETE;
    no obsolete backups found
    

    Nothing? With 1,522 files on disk?

    We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:

    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    OLDEST          NEWEST          TOTAL_PIECES
    03-DEC-2025     25-APR-2026     1541
    

    The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.

    This was the key insight: RMAN was not broken. The retention policy itself was the problem.


    4. Root Cause — The Architecture Trap

    The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:

    connect target rman/password@DBPRO01
    ...
    format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
    (database);
    ...
    delete obsolete;
    

    DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.

    This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.

    The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.

    The math:

    Retention policy REDUNDANCY 30
    Level 0 frequency Weekly (Sundays)
    Effective retention period ~30 weeks / 7.5 months
    Backup tracking since December 2025 (~5 months)
    Result delete obsolete finds nothing — ever
    Daily backup size ~7–7.5G per run
    Total accumulated 744G

    Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.


    5. Secondary Findings During Investigation

    OCT2025 CPU Patch Artifacts (7.7G)

    The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.

    $ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
    # Empty — neither Oct 2025 patch in inventory anymore
    

    5-Year-Old Pre-Migration Export Dumps

    Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.

    Recovery Catalog Version Mismatch

    The original RMAN log flagged this warning:

    PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
    PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current
    

    The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.


    6. The Fix — Emergency Space Recovery

    With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:

    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    

    This command does three things atomically:

    1. Queries catalog/controlfile for all pieces completed before the cutoff date
    2. Deletes the physical files from disk
    3. Removes the records from RMAN catalog — no orphaned entries, no catalog drift

    The output scrolled for several minutes:

    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
    ...
    Deleted 1072 objects
    

    1,072 backup pieces deleted. Catalog updated. Disk checked:

    BEFORE:  Used 1020G  Avail 3.3G  (100%)
    AFTER:   Used  536G  Avail 488G   (53%)
    

    Then the OCT2025_CPU directory was removed:

    $ rm -rf /dbbackup/OCT2025_CPU/
    $ df -hP /dbbackup
    Used 528G  Avail 495G  (52%)
    

    Final result: 495G free. Disk at 52%.

    Both failed backups were re-submitted immediately and ran successfully in parallel:

    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &
    
    $ jobs -l
    [1] Running   nohup sh ...rman_backup_DBPRO01.sh &
    [2] Running   nohup sh ...rman_backup_DBPRO02.sh &
    

    7. Incident Timeline

    12:30 DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full)
    22:30 DBPRO02 Level 0 backup fails — same errors, all 4 channels
    23:08 Investigation begins — df -hP /dbbackup confirms 100% full
    23:15 DBPRC01 directory identified as 744G consumer
    23:25 RMAN connected — REDUNDANCY 30 discovered
    23:35 Architecture confirmed — PRO databases backing up into PRC directories
    23:45 Root cause confirmed — 7.5-month retention, delete obsolete finds nothing
    23:50 DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed
    23:51 1,072 pieces deleted — disk drops to 53%
    23:55 OCT2025_CPU removed — disk at 52%, 495G free
    00:00 Both backup jobs re-submitted and running successfully

    8. Permanent Fix Recommendations

    Fix 1 — Change Retention Policy to RECOVERY WINDOW

    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    

    REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.

    Fix 2 — Add Pre-Backup Space Check to Cron Script

    #!/bin/bash
    BACKUP_FS="/dbbackup"
    THRESHOLD=20
    
    AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')
    
    if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
      echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
        | mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
      exit 1
    fi
    

    A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.

    Fix 3 — Upgrade the Recovery Catalog

    RMAN> CONNECT TARGET /
    RMAN> CONNECT CATALOG rman/password@rmancat
    RMAN> UPGRADE CATALOG;
    RMAN> UPGRADE CATALOG;   -- run twice as prompted
    

    The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.

    Fix 4 — Filesystem Monitoring Alert

    The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.


    9. Key Takeaways for Oracle DBAs

    REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.

    • Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
    • Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
    • Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
    • Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
    • Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.

    10. Commands Reference — Quick Cheat Sheet

    -- Check retention policy
    RMAN> SHOW RETENTION POLICY;
    
    -- Preview what would be deleted (dry run)
    RMAN> REPORT OBSOLETE;
    RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Emergency cleanup — delete pieces older than 30 days
    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Standard cleanup based on retention policy
    RMAN> DELETE NOPROMPT OBSOLETE;
    
    -- Sync catalog after any OS-level file operations
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE NOPROMPT EXPIRED BACKUP;
    
    -- Change to time-based retention (recommended)
    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    
    -- Check backup piece date range in control file
    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    -- Check backup history
    SELECT session_key, input_type, status,
           TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
           output_bytes_display, time_taken_display
    FROM v$rman_backup_job_details
    ORDER BY start_time DESC;
    

    Conclusion

    What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.

    That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.


    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.

  • Oracle EBS 12.2 — ADOP fs_clone Failure: Failed to Delete FMW_Home (Root Cause & Fix)

    Category: Oracle EBS 12.2  |  Topic: ADOP Patching  |  Difficulty: Intermediate  |  Oracle Support: Search ADOP fs_clone Failed to delete FMW_Home on My Oracle Support

    Introduction

    Oracle EBS 12.2 introduced Online Patching (ADOP), which relies on a dual file system architecture — a Run File System (fs2) where production runs, and a Patch File System (fs1) where patches are applied. The fs_clone phase synchronises fs1 from fs2 at the start of each patching cycle, making fs1 a fresh copy of the production file system.

    One of the most common issues encountered during fs_clone is a failure while trying to delete the FMW_Home directory on the Patch FS. This blog walks through a real production scenario on a 2-node RAC database with 4 application server nodes — covering the exact error, step-by-step diagnostic process, root cause identification, fix applied, and the final successful run with actual timings. All server-specific details have been anonymised.

    This issue applies to Oracle EBS 12.2.x on all platforms. For related Oracle Support articles, search “ADOP fs_clone Failed to delete FMW_Home” on My Oracle Support.

    Environment

    ParameterValue
    ApplicationOracle E-Business Suite 12.2 (2-Node RAC DB + 4 Application Server Nodes)
    ADOP VersionC.Delta.13
    ADOP Session ID129
    Run File System (fs2)/u01/app/fs2 (Production — active)
    Patch File System (fs1)/u01/app/fs1 (Inactive — patching target)
    Shared StorageNFS-mounted shared volume (1.3T, 447G free)
    OS Userapplmgr

    Incident Timeline

    EventTimestampDurationOutcome
    1st run startedApr 11, 2026 17:07:35time adop phase=fs_clone executed
    1st run failedApr 11, 2026 ~18:41~1h 34mFATAL ERROR — FMW_Home deletion failed
    Diagnosis performedApr 11, 2026 18:45–19:44~59mRoot cause identified — root-owned OHS log files
    Fix applied (mv)Apr 11, 2026 ~19:44SecondsFMW_Home renamed to dated backup as applmgr
    2nd run startedApr 11, 2026 19:45time adop phase=fs_clone re-executed after fix
    2nd run completedApr 12, 2026 00:49:085h 21m 25sSUCCESS — all 4 app nodes completed ✅
    Total session elapsedApr 11 17:07 → Apr 12 00:497h 46m 33sFull session including failed run + fix + retry

    The Issue

    During time adop phase=fs_clone, the synchronisation process was progressing normally — staging the file system clone, detaching Oracle Homes, removing APPL_TOP and COMM_TOP — until it reached stage 6 (REMOVE-1012-ORACLE-HOME) inside the removeFMWHome() function, where it attempted to delete the FMW_Home directory on the Patch FS and hit a fatal error.

    Error on the Console

    fs_clone/remote_execution_result_level1.xml:
    *******FATAL ERROR*******
    PROGRAM : (.../fs2/EBSapps/appl/ad/12.0.0/patch/115/bin/txkADOPPreparePhaseSynchronize.pl)
    TIME    : Apr 11 18:41:40 2026
    FUNCTION: main::removeDirectory [ Level 1 ]
    ERRORMSG: Failed to delete the directory /u01/app/fs1/FMW_Home.
    [UNEXPECTED]fs_clone has failed

    Key Log File: txkADOPPreparePhaseSynchronize.log

    The primary log file is located at:

    $ADOP_LOG_DIR/<session_id>/<timestamp>/fs_clone/<node>/TXK_SYNC_create/
        txkADOPPreparePhaseSynchronize.log

    Inside this log, the clone status progression was clearly visible:

    ========================== Inside getCloneStatus()... ==========================
    clone_status             = REMOVE-1012-ORACLE-HOME
    clone_status_from_caller = 7
    clone_status_from_db     = 6
    Removing the directory: /u01/app/fs1/FMW_Home
    Failed to delete the directory /u01/app/fs1/FMW_Home.
    *******FATAL ERROR*******
    FUNCTION: main::removeDirectory [ Level 1 ]
    ERRORMSG: Failed to delete the directory /u01/app/fs1/FMW_Home.

    clone_status_from_db = 6 indicates the process had already completed: fs_clone staging, detach of Oracle Homes, removal of APPL_TOP, COMM_TOP, and 10.1.2 Oracle Home. It failed specifically and only while removing FMW_Home.

    ADOP fs_clone Stage Flow

    Stage DBClone StatusDescription
    1STARTEDSession initialised
    2FSCLONESTAGE-DONEFile system staging completed
    3DEREGISTER-ORACLE-HOMESOracle Homes deregistered from inventory
    4REMOVE-APPL-TOPAPPL_TOP removed from Patch FS
    5REMOVE-COMM-TOPCOMM_TOP removed from Patch FS
    6REMOVE-1012-ORACLE-HOMERemoving FMW_Home — ❌ FAILED HERE
    7+(clone proceeds…)Clone fs2 to fs1, re-register homes, config clone

    Diagnostic Steps

    Step 1 — Confirm You Are on the Correct File System

    Most critical check first. The target must be on Patch FS (fs1), never Run FS (fs2):

    echo $FILE_EDITION   # Must show: run
    echo $RUN_BASE       # Must show path to fs2
    $ echo $FILE_EDITION
    run
    $ echo $RUN_BASE
    /u01/app/fs2

    ⚠️ If FILE_EDITION shows patch, stop immediately — source the Run FS environment before proceeding.

    Step 2 — Check for Open File Handles

    lsof +D /u01/app/fs1/FMW_Home 2>/dev/null
    fuser -cu /u01/app/fs1/FMW_Home 2>&1

    In our case both commands returned empty output — no active process was holding FMW_Home open.

    Step 3 — Identify Root-Owned Files

    find /u01/app/fs1/FMW_Home ! -user applmgr -ls 2>/dev/null

    Output revealed multiple root-owned files under the OHS instance directories:

    drwxr-x---  3 root root 4096 Feb 15 06:48 .../EBS_web_OHS4/auditlogs/OHS
    -rw-------  1 root root    0 Feb 15 06:48 .../EBS_web_OHS4/diagnostics/logs/OHS/EBS_web/sec_audit_log
    -rw-r-----  1 root root 5670 Feb 15 06:49 .../EBS_web_OHS4/diagnostics/logs/OHS/EBS_web/EBS_web.log
    -rw-r-----  1 root root  249 Feb 15 06:49 .../EBS_web_OHS4/diagnostics/logs/OHS/EBS_web/access_log
    ... (same pattern for EBS_web_OHS2 and EBS_web_OHS3)

    All root-owned files were dated February 15 — nearly 2 months stale. This confirmed they were leftovers from OHS being incorrectly started as root during a previous patching cycle. No active process was involved.

    Step 4 — Attempt Manual Delete to Confirm the Error

    rm -rf /u01/app/fs1/FMW_Home 2>&1 | head -5
    rm: cannot remove '.../EBS_web_OHS4/diagnostics/logs/OHS/EBS_web/sec_audit_log': Permission denied

    This confirmed the issue was purely a file ownership/permission problem — not filesystem corruption or an NFS issue.

    Step 5 — Check Disk Space

    df -h /u01/app/fs1
    Filesystem      Size  Used Avail Use% Mounted on
    nfs_server:/vol  1.3T  844G  447G  66% /u01

    447GB free — sufficient to retain a backup of FMW_Home by renaming it.

    Root Cause Analysis

    The root cause was OHS (Oracle HTTP Server) being started as root on the Patch File System during a previous patching cycle in February 2026. This created log and audit files owned by root under:

    /u01/app/fs1/FMW_Home/webtier/instances/EBS_web_OHS2/auditlogs/OHS/
    /u01/app/fs1/FMW_Home/webtier/instances/EBS_web_OHS2/diagnostics/logs/OHS/EBS_web/
    /u01/app/fs1/FMW_Home/webtier/instances/EBS_web_OHS3/  (same structure)
    /u01/app/fs1/FMW_Home/webtier/instances/EBS_web_OHS4/  (same structure)

    Since fs_clone runs as applmgr, and applmgr cannot delete files owned by root, the removeDirectory() function in txkADOPPreparePhaseSynchronize.pl failed with Permission Denied — surfaced as a fatal error.

    Why did OHS create root-owned files? If OHS start/stop scripts are executed as root or with sudo (instead of using applmgr-owned wrapper scripts), the resulting log and audit files are created with root ownership and persist on the Patch FS across patching cycles.

    Pre-Action Safety Checklist

    CheckExpectedResult
    FILE_EDITION = runrun✅ PASS
    RUN_BASE points to fs2/u01/app/fs2✅ PASS
    FMW_Home target is on fs1 (Patch FS only)fs1 only✅ PASS
    lsof returns empty (no open handles)Empty✅ PASS
    Root-owned files are stale (no active processes)Stale only✅ PASS
    Sufficient disk space for backup rename> 50GB free✅ PASS
    Production services confirmed running on fs2fs2 up✅ PASS

    Solution — Move FMW_Home as Backup

    The safest approach on production is to move (rename) FMW_Home rather than deleting it. This avoids the need for root access entirely, completes in seconds, and preserves a backup.

    Why mv works even with root-owned files: mv on the same filesystem is a purely atomic rename at the directory level. It does not touch or modify any file contents inside the directory — so applmgr can rename FMW_Home even if files inside are owned by root. This is fundamentally different from rm -rf, which must access and remove each individual file.

    Step 1 — Move FMW_Home as a Dated Backup

    mv /u01/app/fs1/FMW_Home /u01/app/fs1/FMW_Home_$(date +%d%b%Y)_bkp && echo "MOVE SUCCESSFUL"
    MOVE SUCCESSFUL

    Step 2 — Verify FMW_Home Is Gone

    ls -lrt /u01/app/fs1/

    Step 3 — Confirm You Are applmgr Before Retrying

    whoami
    # Expected output: applmgr

    ⚠️ Never run adop as root. Always confirm whoami shows applmgr before executing any adop command.

    Step 4 — Retry fs_clone

    time adop phase=fs_clone

    Running fs_clone Safely on Production

    time adop phase=fs_clone on a 2-node RAC with 4 application server nodes takes several hours. Never run it in a plain SSH/PuTTY session that could disconnect. Use one of the following:

    • VNC Session (Best): Network drops have zero impact on the running process.
    • nohup: nohup adop phase=fs_clone > /tmp/fsclone_$(date +%Y%m%d_%H%M%S).log 2>&1 &
    • screen: screen -S fsclone then time adop phase=fs_clone. Detach with Ctrl+A D, reattach with screen -r fsclone.

    Successful Run — 2nd Attempt

    After applying the fix, time adop phase=fs_clone was re-executed. The adopmon output confirmed all 4 application nodes progressing through validation, port blocking, clone steps, and config clone phases without any errors.

    ADOP (C.Delta.13)
    Session Id: 129
    Command:    status
    Node Name   Node Type  Phase        Status     Started               Finished              Elapsed
    ----------  ---------  -----------  ---------  --------------------  --------------------  -------
    app-node1   master     FS_CLONE     COMPLETED  2026/04/11 17:07:35   2026/04/12 00:49:08   7:46:33
    app-node2   slave      CONFIG_CLONE COMPLETED  2026/04/11 17:07:36   2026/04/12 01:01:55   7:47:19
    app-node3   slave      CONFIG_CLONE COMPLETED  2026/04/11 17:07:36   2026/04/12 01:01:25   7:47:49
    app-node4   slave      CONFIG_CLONE COMPLETED  2026/04/11 17:07:36   2026/04/12 01:02:16   7:47:40
    File System Synchronization Type: Full
    adop exiting with status = 0 (Success)
    Summary report for current adop session:
        Node app-node1:  - Fs_clone status: Completed successfully
        Node app-node2:  - Fs_clone status: Completed successfully
        Node app-node3:  - Fs_clone status: Completed successfully
        Node app-node4:  - Fs_clone status: Completed successfully
    adop exiting with status = 0 (Success)
    real    321m25.733s   (5 hours 21 minutes 25 seconds)
    user     40m1.142s
    sys      70m59.804s
    NodeTypeStartedFinishedElapsed
    app-node1MasterApr 11, 2026 17:07:35Apr 12, 2026 00:49:087h 46m 33s
    app-node2SlaveApr 11, 2026 17:07:36Apr 12, 2026 01:01:557h 47m 19s
    app-node3SlaveApr 11, 2026 17:07:36Apr 12, 2026 01:01:257h 47m 49s
    app-node4SlaveApr 11, 2026 17:07:36Apr 12, 2026 01:02:167h 47m 40s

    The 2nd run completed cleanly in 5 hours 21 minutes 25 seconds across all 4 application nodes. File System Synchronization Type: Full.

    Post-Resolution Cleanup

    After a successful fs_clone and full patching cycle, old FMW_Home backups can be removed. Keep the most recent backup until the next patching cycle completes, then clean up older ones as root (since they may contain root-owned files):

    ls -lrt /u01/app/fs1/FMW_Home*
    du -sh /u01/app/fs1/FMW_Home*
    # Remove old backups as root
    sudo rm -rf /u01/app/fs1/FMW_Home_<old_date>_bkp

    Prevention — Avoiding Recurrence

    • Never start OHS as root. Always use applmgr-owned wrapper scripts. Never use sudo or root to run adohs.sh or adadminsrvctl.sh.
    • Post-patching ownership check. After every adop finalize/cutover, run: find /u01/app/fs1 ! -user applmgr -ls 2>/dev/null | head -20
    • Pre-fs_clone health check. Verify no lingering adop sessions, confirm Run FS services are healthy, check disk space, and verify no root-owned files under fs1/FMW_Home before starting.

    Summary

    ItemDetail
    Phaseadop phase=fs_clone
    Failing Functionmain::removeDirectory inside removeFMWHome()
    Clone Stageclone_status_from_db = 6 (REMOVE-1012-ORACLE-HOME)
    Root CauseOHS started as root in a previous cycle — stale root-owned OHS log/audit files blocking applmgr deletion
    Production ImpactNone — fs1 is Patch FS, production ran on fs2 throughout
    Fix Appliedmv FMW_Home to dated backup as applmgr — atomic rename, no root needed, completed in seconds. rm -rf was NOT used.
    1st Run Duration~1h 34m before fatal error (Apr 11 17:07 → 18:41)
    2nd Run Duration5h 21m 25s — completed successfully (Apr 11 19:45 → Apr 12 00:49)
    Total Session Elapsed7h 46m 33s (including failed run, diagnosis, fix, and retry)
    Final Statusadop exiting with status = 0 (Success) — all 4 app nodes completed ✅
    PreventionNever start OHS as root; add post-patching ownership check to runbook
    Oracle SupportSearch “ADOP fs_clone Failed to delete FMW_Home” on My Oracle Support

    Happy Debugging! All server-specific details have been anonymised. The diagnostic commands and fix are generic and applicable to any Oracle EBS 12.2.x environment. If this helped you, feel free to share with the community.

  • Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

    Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.


    Observed Symptoms

    ORA-00031: session marked for kill
    Thread 1 cannot allocate new log
    Private strand flush not complete

    Phase 1: Interpreting ORA-00031 Correctly

    ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.


    Phase 2: Identifying the True Performance Signal

    The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.


    Phase 3: Evidence-Based Analysis (Read-Only)

    Redo switch frequency was analyzed to validate system behavior:

    SELECT
        TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
        COUNT(*) AS switches
    FROM v$log_history
    WHERE first_time > SYSDATE - 1
    GROUP BY TRUNC(first_time, 'HH24')
    ORDER BY 1;

    Findings

    MetricObservation
    Average Switch Rate5-7 per hour
    Peak Rate8-10 per hour during business hours
    Off-Peak Rate1-3 per hour

    A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.


    Why No Changes Were Made

    In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.


    Recommendations

    • Continuously monitor redo switch frequency during peak windows
    • Use collected data to justify future redo log sizing via change management
    • Avoid unnecessary intervention when behavior is transient and non-impacting
    • Distinguish informational alert log messages from actionable errors

    Key Takeaways

    • ORA-00031 is expected and harmless — it confirms session termination
    • Redo allocation waits are transient under sustained load
    • Proper analysis prevents unnecessary production intervention
    • Not all alert log warnings indicate failure — some are early signals of workload growth
    • The goal is not to eliminate every alert, but to understand which ones matter

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn