Category: Uncategorized

  • ORA-00257: Archiver Error in Oracle 19c CDB — Diagnosis and Resolution

    Background

    This post documents a real production incident where an Oracle E-Business Suite (EBS) environment became completely unresponsive due to a full Fast Recovery Area (FRA). Users were reporting that everything was either slow or hanging — a classic sign that something fundamental had broken at the database layer.

    Symptoms

    • EBS application users reporting sessions hanging or extremely slow response
    • sqlplus apps/<password> failing with:
    ORA-00257: Archiver error. Connect AS SYSDBA only until resolved.
    • Non-SYSDBA connections completely blocked
    • 30 FNDLIBR (Concurrent Manager) processes running on the application server — higher than expected for a QA environment

    Environment

    • Oracle Database 19c (CDB/PDB architecture)
    • PDB: Application database PDB
    • EBS 12.2 on AIX
    • FRA configured on ASM diskgroup (+RECO), size 4095 GB
    • No explicit log_archive_dest — archivelogs defaulting to FRA

    Step 1 — Identify the Root Cause

    Connected to the CDB as SYSDBA and confirmed the error:

    SHOW PARAMETER db_recovery_file_dest;
    NAME                        TYPE        VALUE
    --------------------------- ----------- ------
    db_recovery_file_dest       string      +RECO
    db_recovery_file_dest_size  big integer 4095G

    Checked FRA usage:

    COLUMN name           FORMAT A10
    COLUMN limit_gb       FORMAT 999,999.99 HEADING 'LIMIT GB'
    COLUMN used_gb        FORMAT 999,999.99 HEADING 'USED GB'
    COLUMN reclaimable_gb FORMAT 999,999.99 HEADING 'RECLAIMABLE GB'
    COLUMN number_of_files FORMAT 99999     HEADING 'FILES'
    
    SELECT name,
           ROUND(space_limit/1024/1024/1024,2)       limit_gb,
           ROUND(space_used/1024/1024/1024,2)        used_gb,
           ROUND(space_reclaimable/1024/1024/1024,2) reclaimable_gb,
           number_of_files
    FROM   v$recovery_file_dest;

    Output:

    NAME         LIMIT GB     USED GB  RECLAIMABLE GB  FILES
    ---------- ----------- ----------- -------------- ------
    +RECO         4,095.00    4,075.65            .00   2426

    FRA was at 99.5% — 4,075 GB used out of 4,095 GB, zero reclaimable, 2,426 archivelog files.

    This was the root cause. With the FRA full and nothing reclaimable, the archiver (ARCn) process could not write new archive logs, blocking all database activity.

    Step 2 — Assess Backup Status Before Taking Action

    Before deleting any archivelogs, it is critical to understand the backup posture. Blindly deleting archivelogs without knowing the backup state can leave the database unrecoverable.

    Started an RMAN session and ran a crosscheck first:

    rman target /
    RMAN> crosscheck archivelog all;

    All 2,416 objects returned “validation succeeded” — no expired logs. This meant the FRA was genuinely full of valid, undeleted archivelogs.

    Next, checked backup history:

    RMAN> list backup summary;

    Backups were listed daily going back several weeks — however, on closer inspection:

    SELECT session_key,
           TO_CHAR(start_time,'DD-MON-YY HH24:MI') start_time,
           TO_CHAR(end_time,'DD-MON-YY HH24:MI')   end_time,
           status,
           input_bytes,
           output_bytes
    FROM   v$rman_backup_job_details
    ORDER  BY start_time DESC
    FETCH FIRST 10 ROWS ONLY;

    Output revealed a critical finding:

    SESSION_KEY  START_TIME         END_TIME           STATUS    INPUT_BYTES  OUTPUT_BYTES
    -----------  -----------------  -----------------  --------  -----------  ------------
          19986  30-APR-26 01:00    30-APR-26 01:00    COMPLETED           0         98304
          19904  29-APR-26 01:00    29-APR-26 01:00    COMPLETED           0         98304
    • INPUT_BYTES = 0 — no datafiles were being backed up
    • OUTPUT_BYTES = 98304 (96 KB) — only a controlfile/SPFILE autobackup
    • Jobs completing in under a minute — impossible for a real full DB backup

    Additionally:

    RMAN> list backup of archivelog all;
    -- specification does not match any backup in the repository

    No archivelog backups existed at all. The scheduled backup job was not performing proper datafile or archivelog backups — a separate issue to be addressed post-incident.

    Step 3 — Escalation and Approval

    Given that there were no archivelog backups and the datafile backup was questionable, the decision to delete archivelogs and  Approval was obtained with the instruction to start conservatively — delete older than 15 days first, then 7 days if needed.

    Step 4 — Resolution

    First attempt — delete older than 15 days:

    RMAN> delete noprompt archivelog until time 'SYSDATE-15';

    RMAN returned warnings:

    RMAN-08138: warning: archived log not deleted - must create more backups

    This is because the RMAN retention policy was blocking deletion of logs that had never been backed up. With Tier 3 approval, used the force option to override:

    RMAN> delete noprompt force archivelog until time 'SYSDATE-15';

    This ran successfully, deleting archivelogs from January through late April.

    FRA usage after deletion:

    USED_GB  RECLAIMABLE_GB  NUMBER_OF_FILES
    -------  --------------  ---------------
     246.71               0              149

    Over 3,800 GB freed — FRA dropped from 99.5% to ~6%.

    Step 5 — Verification

    Forced a log switch and verified archiver status:

    ALTER SYSTEM ARCHIVE LOG CURRENT;
    
    SELECT dest_id, status, error
    FROM   v$archive_dest
    WHERE  status != 'INACTIVE';

    Output:

    DEST_ID  STATUS    ERROR
    -------  --------  -----
          1  VALID

    Archiver resumed successfully. Application connectivity was restored and users confirmed sessions were working normally.

    Redo Log Status Check

    As part of the verification, also confirmed redo log health:

    SELECT thread#,
           sequence#,
           TO_CHAR(first_time,'DD-MON-YY HH24:MI:SS') first_time,
           first_change#,
           archived,
           status
    FROM   v$log
    ORDER  BY thread#, sequence#;
    THREAD#  SEQUENCE#  FIRST_TIME                FIRST_CHANGE#  ARC  STATUS
    -------  ---------  ------------------------  -------------  ---  -------
          1       2417  05-MAY-26 01:56:44         1.7867E+10    NO   INACTIVE
          1       2418  05-MAY-26 03:27:20         1.7867E+10    NO   INACTIVE
          1       2419  05-MAY-26 07:57:10         1.7868E+10    NO   CURRENT

    No stuck or unarchived redo logs — database in healthy state.

    Incident Summary

    Item Detail
    Error ORA-00257: Archiver error
    Root Cause FRA (+RECO ASM diskgroup) at 99.5% capacity
    FRA Before 4,075 GB used / 2,426 files
    FRA After 246 GB used / 149 files
    Action Taken delete noprompt force archivelog until time 'SYSDATE-15'
    Space Freed ~3,829 GB
    Resolution Time ~45 minutes from identification to restoration

    Key Lessons Learned

    1. Monitor FRA Proactively

    Set up OEM 13c threshold alerts on FRA usage at 70% and 85%. Do not wait for ORA-00257 to discover the problem.

    SELECT ROUND(space_used/space_limit*100,2) pct_used
    FROM   v$recovery_file_dest;

    Alert when this exceeds 80%.

    2. Always Check Backup Status Before Deleting Archivelogs

    In this incident, the backup job appeared to be running daily but was only backing up the controlfile (INPUT_BYTES=0). This is a serious gap — verify actual backup content, not just job status.

    3. Investigate the Backup Job

    A proper RMAN backup script for a 19c CDB should include:

    BACKUP DATABASE PLUS ARCHIVELOG DELETE INPUT;

    4. Consider an Archivelog Deletion Policy

    If archivelog backups are not being taken, configure an RMAN retention policy to prevent FRA accumulation:

    CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;

    5. FRA Sizing Review

    Even at 4095 GB, the FRA filled up due to months of uncleaned archivelogs. Review archivelog generation rate and size FRA accordingly, or implement regular cleanup.

    Conclusion

    ORA-00257 is one of those errors that brings an entire EBS environment to its knees instantly. The fix itself is straightforward — free up FRA space — but the investigation matters. Rushing to delete archivelogs without understanding the backup posture can result in an unrecoverable database.

    In this case, careful investigation revealed a deeper issue with the backup job that would have gone unnoticed had we not looked. Always verify, always escalate, and always get approval before deleting recovery-critical files.


    Syed Anwar Ahmed is an Oracle Apps DBA with over 11 years of production experience across Oracle EBS, Database, RAC, GoldenGate, and OEM environments. He writes about real-world Oracle incidents at syedanwarahmedoracle.blog.

  • P1 Incident: /dbbackup Filesystem 100% Full — How We Traced, Fixed, and Recovered Two Failed RMAN Backups in One Night

    🔴 Incident Overview

    Severity P1 — Two production database backups failed
    Environment Oracle 19c (19.30) on Linux x86-64
    Backup Tool RMAN with Recovery Catalog
    Backup Volume /dbbackup — 1TB LVM filesystem
    Databases Affected DBPRO01 (12:30 failure), DBPRO02 (22:30 failure)
    Total Space Recovered ~492G (disk went from 100% → 52%)

    1. The Alerts — Two Failures, Same Root Cause

    It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:

    RMAN-03009: failure of backup command on c4 channel at 22:32:13
    ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
    ORA-27072: File I/O error
    Additional information: 4
    

    Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.

    A quick check confirmed it:

    $ df -hP /dbbackup
    Filesystem                        Size  Used Avail Use%
    /dev/mapper/orabkupvg-orabkuplv1 1023G 1020G  3.3G 100%
    

    1TB volume. 3.3G free. Completely full.

    What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.


    2. The Investigation — Folder by Folder

    The first step was understanding what was consuming the disk. One command gave us the top-level picture:

    $ du -sh /dbbackup/*
    744G    DBPRC01
    152G    ColdBackup_11April2026
     41G    DBPRC02
     26G    DBPRC03
     15G    DBPRO01
    7.7G    JAN2026_CPU
    7.7G    OCT2025_CPU
    5.3G    infra_arch
    

    744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.

    Drilling into DBPRC01

    $ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
    7.5G    DiffInc_DBPRC01_fl4lumij
    7.5G    DiffInc_DBPRC01_eg4lrnh5
    7.5G    DiffInc_DBPRC01_b44lc7uh
    ...
    

    Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.

    $ ls /dbbackup/DBPRC01/rman/ | wc -l
    1522
    

    1,522 backup pieces. We checked the oldest and newest:

    Oldest file on disk:  2022-05-07
    Newest file on disk:  2026-04-25
    

    Four years of backup files on disk — or so we thought.


    3. The RMAN Investigation — Where Things Got Interesting

    We connected RMAN to the database and ran the retention check:

    RMAN> SHOW RETENTION POLICY;
    CONFIGURE RETENTION POLICY TO REDUNDANCY 30;
    

    REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.

    Next logical step — check what RMAN considers obsolete:

    RMAN> REPORT OBSOLETE;
    no obsolete backups found
    

    Nothing? With 1,522 files on disk?

    We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:

    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    OLDEST          NEWEST          TOTAL_PIECES
    03-DEC-2025     25-APR-2026     1541
    

    The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.

    This was the key insight: RMAN was not broken. The retention policy itself was the problem.


    4. Root Cause — The Architecture Trap

    The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:

    connect target rman/password@DBPRO01
    ...
    format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
    (database);
    ...
    delete obsolete;
    

    DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.

    This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.

    The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.

    The math:

    Retention policy REDUNDANCY 30
    Level 0 frequency Weekly (Sundays)
    Effective retention period ~30 weeks / 7.5 months
    Backup tracking since December 2025 (~5 months)
    Result delete obsolete finds nothing — ever
    Daily backup size ~7–7.5G per run
    Total accumulated 744G

    Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.


    5. Secondary Findings During Investigation

    OCT2025 CPU Patch Artifacts (7.7G)

    The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.

    $ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
    # Empty — neither Oct 2025 patch in inventory anymore
    

    5-Year-Old Pre-Migration Export Dumps

    Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.

    Recovery Catalog Version Mismatch

    The original RMAN log flagged this warning:

    PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
    PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current
    

    The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.


    6. The Fix — Emergency Space Recovery

    With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:

    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    

    This command does three things atomically:

    1. Queries catalog/controlfile for all pieces completed before the cutoff date
    2. Deletes the physical files from disk
    3. Removes the records from RMAN catalog — no orphaned entries, no catalog drift

    The output scrolled for several minutes:

    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
    deleted backup piece
    backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
    ...
    Deleted 1072 objects
    

    1,072 backup pieces deleted. Catalog updated. Disk checked:

    BEFORE:  Used 1020G  Avail 3.3G  (100%)
    AFTER:   Used  536G  Avail 488G   (53%)
    

    Then the OCT2025_CPU directory was removed:

    $ rm -rf /dbbackup/OCT2025_CPU/
    $ df -hP /dbbackup
    Used 528G  Avail 495G  (52%)
    

    Final result: 495G free. Disk at 52%.

    Both failed backups were re-submitted immediately and ran successfully in parallel:

    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
    $ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &
    
    $ jobs -l
    [1] Running   nohup sh ...rman_backup_DBPRO01.sh &
    [2] Running   nohup sh ...rman_backup_DBPRO02.sh &
    

    7. Incident Timeline

    12:30 DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full)
    22:30 DBPRO02 Level 0 backup fails — same errors, all 4 channels
    23:08 Investigation begins — df -hP /dbbackup confirms 100% full
    23:15 DBPRC01 directory identified as 744G consumer
    23:25 RMAN connected — REDUNDANCY 30 discovered
    23:35 Architecture confirmed — PRO databases backing up into PRC directories
    23:45 Root cause confirmed — 7.5-month retention, delete obsolete finds nothing
    23:50 DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed
    23:51 1,072 pieces deleted — disk drops to 53%
    23:55 OCT2025_CPU removed — disk at 52%, 495G free
    00:00 Both backup jobs re-submitted and running successfully

    8. Permanent Fix Recommendations

    Fix 1 — Change Retention Policy to RECOVERY WINDOW

    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    

    REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.

    Fix 2 — Add Pre-Backup Space Check to Cron Script

    #!/bin/bash
    BACKUP_FS="/dbbackup"
    THRESHOLD=20
    
    AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')
    
    if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
      echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
        | mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
      exit 1
    fi
    

    A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.

    Fix 3 — Upgrade the Recovery Catalog

    RMAN> CONNECT TARGET /
    RMAN> CONNECT CATALOG rman/password@rmancat
    RMAN> UPGRADE CATALOG;
    RMAN> UPGRADE CATALOG;   -- run twice as prompted
    

    The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.

    Fix 4 — Filesystem Monitoring Alert

    The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.


    9. Key Takeaways for Oracle DBAs

    REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.

    • Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
    • Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
    • Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
    • Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
    • Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.

    10. Commands Reference — Quick Cheat Sheet

    -- Check retention policy
    RMAN> SHOW RETENTION POLICY;
    
    -- Preview what would be deleted (dry run)
    RMAN> REPORT OBSOLETE;
    RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Emergency cleanup — delete pieces older than 30 days
    RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
    
    -- Standard cleanup based on retention policy
    RMAN> DELETE NOPROMPT OBSOLETE;
    
    -- Sync catalog after any OS-level file operations
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE NOPROMPT EXPIRED BACKUP;
    
    -- Change to time-based retention (recommended)
    RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
    
    -- Check backup piece date range in control file
    SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
           TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
           COUNT(*) total_pieces
    FROM v$backup_piece_details
    WHERE status = 'A';
    
    -- Check backup history
    SELECT session_key, input_type, status,
           TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
           output_bytes_display, time_taken_display
    FROM v$rman_backup_job_details
    ORDER BY start_time DESC;
    

    Conclusion

    What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.

    That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.


    If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.

  • Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

    Introduction

    One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

    This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

    Environment Overview

    Component Details
    Database Oracle 11g R2 (11.2.0.4.0)
    Architecture RAC — 2 Nodes
    Database Role PRIMARY — READ WRITE, ARCHIVELOG mode
    OS Linux (x86_64)
    FRA Location ASM Diskgroup +ARCHIVELOG
    FRA Size (pre-incident) 199 GB
    FRA Size (post-fix) 400 GB

    The Incident: What We Saw

    The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

    ORA-00257: archiver error. Connect internal only, until freed.
    ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
    ARCH: Archival stopped, error occurred. Will continue retrying
    ORACLE Instance PRODDB1 - Archival Error

    The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

    Step 1: Check Redo Log Status

    First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

    SELECT group#, status, archived, members FROM v$log ORDER BY group#;
    
    GROUP#   STATUS           ARCHIVED   MEMBERS
    ------   ---------------  ---------  -------
    1        ACTIVE           NO         2
    2        CURRENT          NO         2
    3        INACTIVE         YES        2
    4        INACTIVE         YES        2

    Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

    Step 2: Confirm the Archiver Status

    -- Check both RAC instances
    SELECT inst_id, instance_name, host_name, version,
           database_role, open_mode, log_mode
    FROM gv$instance ORDER BY inst_id;
    
    INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
    -------  ----------  ------------  -------------  ----------  ----------
          1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
          2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
    
    SELECT instance_name, status, archiver FROM v$instance;
    
    INSTANCE_NAME    STATUS       ARCHIVER
    ---------------- ------------ --------
    PRODDB1          OPEN         STOPPED

    ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

    Step 3: Check FRA Usage

    SELECT
        space_limit / (1024*1024*1024)       AS limit_gb,
        space_used  / (1024*1024*1024)       AS used_gb,
        space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
        ROUND((space_used / space_limit) * 100, 2) AS pct_used
    FROM v$recovery_file_dest;
    
    LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
    --------   -------   --------------   --------
    199        198.99     0.00            100.00

    100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

    Step 4: Identify What Was Consuming the FRA

    SELECT file_type,
        ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
        ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
        number_of_files
    FROM v$recovery_area_usage
    ORDER BY space_used DESC;
    
    FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
    -----------     -------   --------------   ---------------
    ARCHIVED LOG    38.50     0.05             1,200+
    BACKUP PIECE    10.20     0.07             45
    FLASHBACK LOG    1.30     0.00             12

    Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

    Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
           COUNT(*) AS logs,
           ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
    FROM v$archived_log
    WHERE dest_id=1
    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
    ORDER BY 1;
    Hour Archived Logs GB Generated
    Daytime (typical) 2–26 < 1.3 GB
    01:00 AM (nightly) 440–446 ~31 GB ⚠
    03:00–04:00 AM 191–411 11–28 GB ⚠
    05:00–06:00 AM 537–796 34–50 GB ⚠

    Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

    The Resolution — Three Actions

    Action 1: Manually Delete Archive Logs from ASM

    Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

    su - grid
    asmcmd
    
    ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
    # Recent logs retained to preserve recoverability

    After ASMCMD deletion, the RMAN catalog was resynced:

    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> DELETE EXPIRED ARCHIVELOG ALL;

    Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

    Action 2: Increase FRA Size Online

    -- Increase FRA from 199GB to 400GB (online, no restart required)
    ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

    SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

    ARC1: Archival started
    ARC2: Archival started
    ARC3: Archival started
    ORACLE Instance PRODDB1 - Archival Restarted

    Applications reconnected and transactions resumed within seconds.

    Action 3: Verify Database and Cluster Status

    SELECT instance_name, archiver FROM v$instance;
    -- PRODDB1   STARTED
    
    SELECT round(space_used/1024/1024/1024,2) used_gb,
           round(space_limit/1024/1024/1024,2) limit_gb,
           round((space_used/space_limit)*100,2) used_pct
    FROM v$recovery_file_dest;
    -- 202.80 GB / 400 GB / 50.70%
    
    SELECT group#, status, archived FROM v$log ORDER BY group#;
    -- All groups INACTIVE/YES except CURRENT group

    Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

    Root Cause Confirmed: Archive Log Backups Were Not Running

    With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

    Complete Incident Chain:
    ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

    Fix: Correct the Cron Environment

    Broken cron entry:

    05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    Fixed cron entry:

    05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    ARCH Backup Reruns and Completes

    After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

    . /home/oracle/.bash_profile
    /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

    The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

    Preventive Measures Put in Place

    1. Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
    2. RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
    3. FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
    4. FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
    5. RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

    Key Takeaways for Oracle DBAs

    • ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
    • When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
    • db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
    • Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
    • The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
    • Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
    • Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
    • After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
    • Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

    Conclusion

    A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

    The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

    The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

    Syed Anwar Ahmed | Oracle Apps DBA