Tag: fra

  • ORA-00257: Archiver Error in Oracle 19c CDB — Diagnosis and Resolution

    Background

    This post documents a real production incident where an Oracle E-Business Suite (EBS) environment became completely unresponsive due to a full Fast Recovery Area (FRA). Users were reporting that everything was either slow or hanging — a classic sign that something fundamental had broken at the database layer.

    Symptoms

    • EBS application users reporting sessions hanging or extremely slow response
    • sqlplus apps/<password> failing with:
    ORA-00257: Archiver error. Connect AS SYSDBA only until resolved.
    • Non-SYSDBA connections completely blocked
    • 30 FNDLIBR (Concurrent Manager) processes running on the application server — higher than expected for a QA environment

    Environment

    • Oracle Database 19c (CDB/PDB architecture)
    • PDB: Application database PDB
    • EBS 12.2 on AIX
    • FRA configured on ASM diskgroup (+RECO), size 4095 GB
    • No explicit log_archive_dest — archivelogs defaulting to FRA

    Step 1 — Identify the Root Cause

    Connected to the CDB as SYSDBA and confirmed the error:

    SHOW PARAMETER db_recovery_file_dest;
    NAME                        TYPE        VALUE
    --------------------------- ----------- ------
    db_recovery_file_dest       string      +RECO
    db_recovery_file_dest_size  big integer 4095G

    Checked FRA usage:

    COLUMN name           FORMAT A10
    COLUMN limit_gb       FORMAT 999,999.99 HEADING 'LIMIT GB'
    COLUMN used_gb        FORMAT 999,999.99 HEADING 'USED GB'
    COLUMN reclaimable_gb FORMAT 999,999.99 HEADING 'RECLAIMABLE GB'
    COLUMN number_of_files FORMAT 99999     HEADING 'FILES'
    
    SELECT name,
           ROUND(space_limit/1024/1024/1024,2)       limit_gb,
           ROUND(space_used/1024/1024/1024,2)        used_gb,
           ROUND(space_reclaimable/1024/1024/1024,2) reclaimable_gb,
           number_of_files
    FROM   v$recovery_file_dest;

    Output:

    NAME         LIMIT GB     USED GB  RECLAIMABLE GB  FILES
    ---------- ----------- ----------- -------------- ------
    +RECO         4,095.00    4,075.65            .00   2426

    FRA was at 99.5% — 4,075 GB used out of 4,095 GB, zero reclaimable, 2,426 archivelog files.

    This was the root cause. With the FRA full and nothing reclaimable, the archiver (ARCn) process could not write new archive logs, blocking all database activity.

    Step 2 — Assess Backup Status Before Taking Action

    Before deleting any archivelogs, it is critical to understand the backup posture. Blindly deleting archivelogs without knowing the backup state can leave the database unrecoverable.

    Started an RMAN session and ran a crosscheck first:

    rman target /
    RMAN> crosscheck archivelog all;

    All 2,416 objects returned “validation succeeded” — no expired logs. This meant the FRA was genuinely full of valid, undeleted archivelogs.

    Next, checked backup history:

    RMAN> list backup summary;

    Backups were listed daily going back several weeks — however, on closer inspection:

    SELECT session_key,
           TO_CHAR(start_time,'DD-MON-YY HH24:MI') start_time,
           TO_CHAR(end_time,'DD-MON-YY HH24:MI')   end_time,
           status,
           input_bytes,
           output_bytes
    FROM   v$rman_backup_job_details
    ORDER  BY start_time DESC
    FETCH FIRST 10 ROWS ONLY;

    Output revealed a critical finding:

    SESSION_KEY  START_TIME         END_TIME           STATUS    INPUT_BYTES  OUTPUT_BYTES
    -----------  -----------------  -----------------  --------  -----------  ------------
          19986  30-APR-26 01:00    30-APR-26 01:00    COMPLETED           0         98304
          19904  29-APR-26 01:00    29-APR-26 01:00    COMPLETED           0         98304
    • INPUT_BYTES = 0 — no datafiles were being backed up
    • OUTPUT_BYTES = 98304 (96 KB) — only a controlfile/SPFILE autobackup
    • Jobs completing in under a minute — impossible for a real full DB backup

    Additionally:

    RMAN> list backup of archivelog all;
    -- specification does not match any backup in the repository

    No archivelog backups existed at all. The scheduled backup job was not performing proper datafile or archivelog backups — a separate issue to be addressed post-incident.

    Step 3 — Escalation and Approval

    Given that there were no archivelog backups and the datafile backup was questionable, the decision to delete archivelogs and  Approval was obtained with the instruction to start conservatively — delete older than 15 days first, then 7 days if needed.

    Step 4 — Resolution

    First attempt — delete older than 15 days:

    RMAN> delete noprompt archivelog until time 'SYSDATE-15';

    RMAN returned warnings:

    RMAN-08138: warning: archived log not deleted - must create more backups

    This is because the RMAN retention policy was blocking deletion of logs that had never been backed up. With Tier 3 approval, used the force option to override:

    RMAN> delete noprompt force archivelog until time 'SYSDATE-15';

    This ran successfully, deleting archivelogs from January through late April.

    FRA usage after deletion:

    USED_GB  RECLAIMABLE_GB  NUMBER_OF_FILES
    -------  --------------  ---------------
     246.71               0              149

    Over 3,800 GB freed — FRA dropped from 99.5% to ~6%.

    Step 5 — Verification

    Forced a log switch and verified archiver status:

    ALTER SYSTEM ARCHIVE LOG CURRENT;
    
    SELECT dest_id, status, error
    FROM   v$archive_dest
    WHERE  status != 'INACTIVE';

    Output:

    DEST_ID  STATUS    ERROR
    -------  --------  -----
          1  VALID

    Archiver resumed successfully. Application connectivity was restored and users confirmed sessions were working normally.

    Redo Log Status Check

    As part of the verification, also confirmed redo log health:

    SELECT thread#,
           sequence#,
           TO_CHAR(first_time,'DD-MON-YY HH24:MI:SS') first_time,
           first_change#,
           archived,
           status
    FROM   v$log
    ORDER  BY thread#, sequence#;
    THREAD#  SEQUENCE#  FIRST_TIME                FIRST_CHANGE#  ARC  STATUS
    -------  ---------  ------------------------  -------------  ---  -------
          1       2417  05-MAY-26 01:56:44         1.7867E+10    NO   INACTIVE
          1       2418  05-MAY-26 03:27:20         1.7867E+10    NO   INACTIVE
          1       2419  05-MAY-26 07:57:10         1.7868E+10    NO   CURRENT

    No stuck or unarchived redo logs — database in healthy state.

    Incident Summary

    Item Detail
    Error ORA-00257: Archiver error
    Root Cause FRA (+RECO ASM diskgroup) at 99.5% capacity
    FRA Before 4,075 GB used / 2,426 files
    FRA After 246 GB used / 149 files
    Action Taken delete noprompt force archivelog until time 'SYSDATE-15'
    Space Freed ~3,829 GB
    Resolution Time ~45 minutes from identification to restoration

    Key Lessons Learned

    1. Monitor FRA Proactively

    Set up OEM 13c threshold alerts on FRA usage at 70% and 85%. Do not wait for ORA-00257 to discover the problem.

    SELECT ROUND(space_used/space_limit*100,2) pct_used
    FROM   v$recovery_file_dest;

    Alert when this exceeds 80%.

    2. Always Check Backup Status Before Deleting Archivelogs

    In this incident, the backup job appeared to be running daily but was only backing up the controlfile (INPUT_BYTES=0). This is a serious gap — verify actual backup content, not just job status.

    3. Investigate the Backup Job

    A proper RMAN backup script for a 19c CDB should include:

    BACKUP DATABASE PLUS ARCHIVELOG DELETE INPUT;

    4. Consider an Archivelog Deletion Policy

    If archivelog backups are not being taken, configure an RMAN retention policy to prevent FRA accumulation:

    CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 7 DAYS;

    5. FRA Sizing Review

    Even at 4095 GB, the FRA filled up due to months of uncleaned archivelogs. Review archivelog generation rate and size FRA accordingly, or implement regular cleanup.

    Conclusion

    ORA-00257 is one of those errors that brings an entire EBS environment to its knees instantly. The fix itself is straightforward — free up FRA space — but the investigation matters. Rushing to delete archivelogs without understanding the backup posture can result in an unrecoverable database.

    In this case, careful investigation revealed a deeper issue with the backup job that would have gone unnoticed had we not looked. Always verify, always escalate, and always get approval before deleting recovery-critical files.


    Syed Anwar Ahmed is an Oracle Apps DBA with over 11 years of production experience across Oracle EBS, Database, RAC, GoldenGate, and OEM environments. He writes about real-world Oracle incidents at syedanwarahmedoracle.blog.

  • Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live

    Introduction

    One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

    This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

    Environment Overview

    Component Details
    Database Oracle 11g R2 (11.2.0.4.0)
    Architecture RAC — 2 Nodes
    Database Role PRIMARY — READ WRITE, ARCHIVELOG mode
    OS Linux (x86_64)
    FRA Location ASM Diskgroup +ARCHIVELOG
    FRA Size (pre-incident) 199 GB
    FRA Size (post-fix) 400 GB

    The Incident: What We Saw

    The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

    ORA-00257: archiver error. Connect internal only, until freed.
    ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
    ARCH: Archival stopped, error occurred. Will continue retrying
    ORACLE Instance PRODDB1 - Archival Error

    The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

    Step 1: Check Redo Log Status

    First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

    SELECT group#, status, archived, members FROM v$log ORDER BY group#;
    
    GROUP#   STATUS           ARCHIVED   MEMBERS
    ------   ---------------  ---------  -------
    1        ACTIVE           NO         2
    2        CURRENT          NO         2
    3        INACTIVE         YES        2
    4        INACTIVE         YES        2

    Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

    Step 2: Confirm the Archiver Status

    -- Check both RAC instances
    SELECT inst_id, instance_name, host_name, version,
           database_role, open_mode, log_mode
    FROM gv$instance ORDER BY inst_id;
    
    INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
    -------  ----------  ------------  -------------  ----------  ----------
          1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
          2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
    
    SELECT instance_name, status, archiver FROM v$instance;
    
    INSTANCE_NAME    STATUS       ARCHIVER
    ---------------- ------------ --------
    PRODDB1          OPEN         STOPPED

    ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

    Step 3: Check FRA Usage

    SELECT
        space_limit / (1024*1024*1024)       AS limit_gb,
        space_used  / (1024*1024*1024)       AS used_gb,
        space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
        ROUND((space_used / space_limit) * 100, 2) AS pct_used
    FROM v$recovery_file_dest;
    
    LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
    --------   -------   --------------   --------
    199        198.99     0.00            100.00

    100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

    Step 4: Identify What Was Consuming the FRA

    SELECT file_type,
        ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
        ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
        number_of_files
    FROM v$recovery_area_usage
    ORDER BY space_used DESC;
    
    FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
    -----------     -------   --------------   ---------------
    ARCHIVED LOG    38.50     0.05             1,200+
    BACKUP PIECE    10.20     0.07             45
    FLASHBACK LOG    1.30     0.00             12

    Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

    Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

    SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
           COUNT(*) AS logs,
           ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
    FROM v$archived_log
    WHERE dest_id=1
    GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
    ORDER BY 1;
    Hour Archived Logs GB Generated
    Daytime (typical) 2–26 < 1.3 GB
    01:00 AM (nightly) 440–446 ~31 GB ⚠
    03:00–04:00 AM 191–411 11–28 GB ⚠
    05:00–06:00 AM 537–796 34–50 GB ⚠

    Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

    The Resolution — Three Actions

    Action 1: Manually Delete Archive Logs from ASM

    Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

    su - grid
    asmcmd
    
    ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
    ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
    # Recent logs retained to preserve recoverability

    After ASMCMD deletion, the RMAN catalog was resynced:

    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> DELETE EXPIRED ARCHIVELOG ALL;

    Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

    Action 2: Increase FRA Size Online

    -- Increase FRA from 199GB to 400GB (online, no restart required)
    ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

    SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

    ARC1: Archival started
    ARC2: Archival started
    ARC3: Archival started
    ORACLE Instance PRODDB1 - Archival Restarted

    Applications reconnected and transactions resumed within seconds.

    Action 3: Verify Database and Cluster Status

    SELECT instance_name, archiver FROM v$instance;
    -- PRODDB1   STARTED
    
    SELECT round(space_used/1024/1024/1024,2) used_gb,
           round(space_limit/1024/1024/1024,2) limit_gb,
           round((space_used/space_limit)*100,2) used_pct
    FROM v$recovery_file_dest;
    -- 202.80 GB / 400 GB / 50.70%
    
    SELECT group#, status, archived FROM v$log ORDER BY group#;
    -- All groups INACTIVE/YES except CURRENT group

    Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

    Root Cause Confirmed: Archive Log Backups Were Not Running

    With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

    Complete Incident Chain:
    ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

    Fix: Correct the Cron Environment

    Broken cron entry:

    05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    Fixed cron entry:

    05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

    ARCH Backup Reruns and Completes

    After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

    . /home/oracle/.bash_profile
    /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

    The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

    Preventive Measures Put in Place

    1. Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
    2. RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
    3. FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
    4. FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
    5. RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

    Key Takeaways for Oracle DBAs

    • ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
    • When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
    • db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
    • Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
    • The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
    • Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
    • Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
    • After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
    • Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

    Conclusion

    A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

    The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

    The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

    Syed Anwar Ahmed | Oracle Apps DBA