archive_error – Syed Anwar Ahmed

Introduction

One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.

This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.

Environment Overview

Component	Details
Database	Oracle 11g R2 (11.2.0.4.0)
Architecture	RAC — 2 Nodes
Database Role	PRIMARY — READ WRITE, ARCHIVELOG mode
OS	Linux (x86_64)
FRA Location	ASM Diskgroup +ARCHIVELOG
FRA Size (pre-incident)	199 GB
FRA Size (post-fix)	400 GB

The Incident: What We Saw

The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:

ORA-00257: archiver error. Connect internal only, until freed.
ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance PRODDB1 - Archival Error

The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.

Step 1: Check Redo Log Status

First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.

SELECT group#, status, archived, members FROM v$log ORDER BY group#;

GROUP#   STATUS           ARCHIVED   MEMBERS
------   ---------------  ---------  -------
1        ACTIVE           NO         2
2        CURRENT          NO         2
3        INACTIVE         YES        2
4        INACTIVE         YES        2

Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.

Step 2: Confirm the Archiver Status

-- Check both RAC instances
SELECT inst_id, instance_name, host_name, version,
       database_role, open_mode, log_mode
FROM gv$instance ORDER BY inst_id;

INST_ID  INSTANCE_N  VERSION       DATABASE_ROLE  OPEN_MODE   LOG_MODE
-------  ----------  ------------  -------------  ----------  ----------
      1  PRODDB1     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG
      2  PRODDB2     11.2.0.4.0    PRIMARY        READ WRITE  ARCHIVELOG

SELECT instance_name, status, archiver FROM v$instance;

INSTANCE_NAME    STATUS       ARCHIVER
---------------- ------------ --------
PRODDB1          OPEN         STOPPED

ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.

Step 3: Check FRA Usage

SELECT
    space_limit / (1024*1024*1024)       AS limit_gb,
    space_used  / (1024*1024*1024)       AS used_gb,
    space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
    ROUND((space_used / space_limit) * 100, 2) AS pct_used
FROM v$recovery_file_dest;

LIMIT_GB   USED_GB   RECLAIMABLE_GB   PCT_USED
--------   -------   --------------   --------
199        198.99     0.00            100.00

100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.

Step 4: Identify What Was Consuming the FRA

SELECT file_type,
    ROUND(space_used / (1024*1024*1024), 2)        AS used_gb,
    ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
    number_of_files
FROM v$recovery_area_usage
ORDER BY space_used DESC;

FILE_TYPE       USED_GB   RECLAIMABLE_GB   NUMBER_OF_FILES
-----------     -------   --------------   ---------------
ARCHIVED LOG    38.50     0.05             1,200+
BACKUP PIECE    10.20     0.07             45
FLASHBACK LOG    1.30     0.00             12

Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.

Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike

SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
       COUNT(*) AS logs,
       ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
FROM v$archived_log
WHERE dest_id=1
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
ORDER BY 1;

Hour	Archived Logs	GB Generated
Daytime (typical)	2–26	< 1.3 GB
01:00 AM (nightly)	440–446	~31 GB ⚠
03:00–04:00 AM	191–411	11–28 GB ⚠
05:00–06:00 AM	537–796	34–50 GB ⚠

Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.

The Resolution — Three Actions

Action 1: Manually Delete Archive Logs from ASM

Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:

su - grid
asmcmd

ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
# Recent logs retained to preserve recoverability

After ASMCMD deletion, the RMAN catalog was resynced:

RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> DELETE EXPIRED ARCHIVELOG ALL;

Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.

Action 2: Increase FRA Size Online

-- Increase FRA from 199GB to 400GB (online, no restart required)
ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';

SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:

ARC1: Archival started
ARC2: Archival started
ARC3: Archival started
ORACLE Instance PRODDB1 - Archival Restarted

Applications reconnected and transactions resumed within seconds.

Action 3: Verify Database and Cluster Status

SELECT instance_name, archiver FROM v$instance;
-- PRODDB1   STARTED

SELECT round(space_used/1024/1024/1024,2) used_gb,
       round(space_limit/1024/1024/1024,2) limit_gb,
       round((space_used/space_limit)*100,2) used_pct
FROM v$recovery_file_dest;
-- 202.80 GB / 400 GB / 50.70%

SELECT group#, status, archived FROM v$log ORDER BY group#;
-- All groups INACTIVE/YES except CURRENT group

Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.

Root Cause Confirmed: Archive Log Backups Were Not Running

With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.

Complete Incident Chain:
ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy

Fix: Correct the Cron Environment

Broken cron entry:

05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

Fixed cron entry:

05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1

ARCH Backup Reruns and Completes

After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:

. /home/oracle/.bash_profile
/home/oracle/scripts/RMAN_Hotbackup.ksh ARCH

The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.

Preventive Measures Put in Place

Fix the cron environment — all RMAN jobs now source ~/.bash_profile before executing.
RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.

Key Takeaways for Oracle DBAs

ORA-00257 always points to the FRA being full — check v$recovery_file_dest and v$recovery_area_usage immediately.
When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run CROSSCHECK and DELETE EXPIRED to resync the catalog.
db_recovery_file_dest_size is dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.
Use SID='*' in RAC — applies the parameter change to all instances simultaneously.
The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
Analyse redo generation patterns — v$archived_log reveals batch spikes that should directly inform FRA sizing decisions.

Conclusion

A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.

The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.

The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.

Syed Anwar Ahmed | Oracle Apps DBA

Tag: archive_error

Stuck Archiver on Oracle 11g RAC: How a Full FRA Brought Down Archiving — and How We Fixed It Live