Tag: archive_error

  • Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

    Introduction

    One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

    This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

    Environment Overview

    Component Details
    Database Oracle 11g R2 (11.2.0.4.0)
    Architecture RAC — 2 Nodes
    Database Role PRIMARY — READ WRITE, ARCHIVELOG mode
    OS Linux (x86_64)
    FRA Location ASM Diskgroup +ARCHIVELOG
    FRA Size (pre-incident) 199 GB
    FRA Size (post-fix) 400 GB

    The Incident: What We Saw

    The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

    ORA-00257: archiver error. Connect internal only, until freed.
    ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
    ARCH: Archival stopped, error occurred. Will continue retrying
    ORACLE Instance PRODDB1 - Archival Error

    The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

    Step 1: Check Redo Log Status

    First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

    SELECT group#, status, archived, members FROM v$log ORDER BY group#;
    
    GROUP#   STATUS           ARCHIVED   MEMBERS
    ------   ---------------  ---------  -------
    1        ACTIVE           NO         2
    2        CURRENT          NO         2
    3        INACTIVE         YES        2
    4        INACTIVE         YES        2

    Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

    Step 2: Confirm the Archiver Status

    -- Check both RAC instances
    SELECT inst_id, instance_name, host_name, version,
           database_role, open_mode, log_mode
    FROM gv$instance ORDER BY inst_id;
    
    INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
    -------  ----------  ------------  -------------  ----------  ----------
          1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
          2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
    
    SELECT instance_name, status, archiver FROM v$instance;
    
    INSTANCE_NAME    STATUS       ARCHIVER
    ---------------- ------------ --------
    PRODDB1          OPEN         STOPPED

    ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

    Step 3: Check FRA Usage

    SELECT
        space_limit / (1024*1024*1024)       AS limit_gb,
        space_used  / (1024*1024*1024)       AS used_gb,
        space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
        ROUND((space_used / space_limit) * 100, 2) AS pct_used
    FROM v$recovery_file_dest;
    
    LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
    --------   -------   --------------   --------
    199        198.99     0.00            100.00

    100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

    Step 4: Identify What Was Consuming the FRA

    SELECT file_type,
        ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
        ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
        number_of_files
    FROM v$recovery_area_usage
    ORDER BY space_used DESC;
    
    FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
    -----------     -------   --------------   ---------------
    ARCHIVED LOG    38.50     0.05             1,200+
    BACKUP PIECE    10.20     0.07             45
    FLASHBACK LOG    1.30     0.00             12

    Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

    Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
           COUNT(*) AS logs,
           ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
    FROM v$archived_log
    WHERE dest_id=1
    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
    ORDER BY 1;
    Hour Archived Logs GB Generated
    Daytime (typical) 2–26 < 1.3 GB
    01:00 AM (nightly) 440–446 ~31 GB ⚠
    03:00–04:00 AM 191–411 11–28 GB ⚠
    05:00–06:00 AM 537–796 34–50 GB ⚠

    Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

    The Resolution — Three Actions

    Action 1: Manually Delete Archive Logs from ASM

    Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

    su - grid
    asmcmd
    
    ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
    # Recent logs retained to preserve recoverability

    After ASMCMD deletion, the RMAN catalog was resynced:

    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> DELETE EXPIRED ARCHIVELOG ALL;

    Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

    Action 2: Increase FRA Size Online

    -- Increase FRA from 199GB to 400GB (online, no restart required)
    ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

    SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

    ARC1: Archival started
    ARC2: Archival started
    ARC3: Archival started
    ORACLE Instance PRODDB1 - Archival Restarted

    Applications reconnected and transactions resumed within seconds.

    Action 3: Verify Database and Cluster Status

    SELECT instance_name, archiver FROM v$instance;
    -- PRODDB1   STARTED
    
    SELECT round(space_used/1024/1024/1024,2) used_gb,
           round(space_limit/1024/1024/1024,2) limit_gb,
           round((space_used/space_limit)*100,2) used_pct
    FROM v$recovery_file_dest;
    -- 202.80 GB / 400 GB / 50.70%
    
    SELECT group#, status, archived FROM v$log ORDER BY group#;
    -- All groups INACTIVE/YES except CURRENT group

    Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

    Root Cause Confirmed: Archive Log Backups Were Not Running

    With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

    Complete Incident Chain:
    ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

    Fix: Correct the Cron Environment

    Broken cron entry:

    05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    Fixed cron entry:

    05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    ARCH Backup Reruns and Completes

    After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

    . /home/oracle/.bash_profile
    /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

    The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

    Preventive Measures Put in Place

    1. Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
    2. RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
    3. FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
    4. FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
    5. RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

    Key Takeaways for Oracle DBAs

    • ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
    • When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
    • db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
    • Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
    • The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
    • Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
    • Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
    • After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
    • Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

    Conclusion

    A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

    The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

    The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

    Syed Anwar Ahmed | Oracle Apps DBA