Introduction
One of the most disruptive incidents a DBA can face in a production RAC environment is a Stuck Archiver — a condition where Oracle’s archiver process (ARCn) is unable to archive online redo logs, effectively stalling all database activity. Transactions pile up, applications hang, and the business takes a hit.
This post documents a real-world Stuck Archiver incident on a two-node Oracle 11g RAC cluster, caused by Fast Recovery Area (FRA) exhaustion. I’ll walk through the exact symptoms, diagnosis queries, root cause analysis, and the live fix — along with preventive measures to ensure this doesn’t repeat.
Environment Overview
| Component | Details |
|---|---|
| Database | Oracle 11g R2 (11.2.0.4.0) |
| Architecture | RAC — 2 Nodes |
| Database Role | PRIMARY — READ WRITE, ARCHIVELOG mode |
| OS | Linux (x86_64) |
| FRA Location | ASM Diskgroup +ARCHIVELOG |
| FRA Size (pre-incident) | 199 GB |
| FRA Size (post-fix) | 400 GB |
The Incident: What We Saw
The alert came in the early hours of the morning. Applications were hanging, users were reporting timeouts, and both RAC instances had entered an INTERMEDIATE state. The alert log was flooded with:
ORA-00257: archiver error. Connect internal only, until freed.
ARC0: Error 19809 Creating archive log file to '<FRA_PATH>'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance PRODDB1 - Archival Error
The archiver was stuck on both instances, and Oracle had locked down the database to internal connections only. The incident was open for approximately 45 minutes before the fix was applied and applications recovered.
Step 1: Check Redo Log Status
First priority — verify the state of online redo log groups. If LGWR cannot overwrite ACTIVE logs because they are unarchived, the database will hang regardless of any other fix.
SELECT group#, status, archived, members FROM v$log ORDER BY group#;
GROUP# STATUS ARCHIVED MEMBERS
------ --------------- --------- -------
1 ACTIVE NO 2
2 CURRENT NO 2
3 INACTIVE YES 2
4 INACTIVE YES 2
Groups 1 and 2 needed archiving. LGWR cannot overwrite ACTIVE or CURRENT redo logs until they are archived — this confirmed the database hang was directly caused by the stuck archiver.
Step 2: Confirm the Archiver Status
-- Check both RAC instances
SELECT inst_id, instance_name, host_name, version,
database_role, open_mode, log_mode
FROM gv$instance ORDER BY inst_id;
INST_ID INSTANCE_N VERSION DATABASE_ROLE OPEN_MODE LOG_MODE
------- ---------- ------------ ------------- ---------- ----------
1 PRODDB1 11.2.0.4.0 PRIMARY READ WRITE ARCHIVELOG
2 PRODDB2 11.2.0.4.0 PRIMARY READ WRITE ARCHIVELOG
SELECT instance_name, status, archiver FROM v$instance;
INSTANCE_NAME STATUS ARCHIVER
---------------- ------------ --------
PRODDB1 OPEN STOPPED
ARCHIVER = STOPPED confirmed it. Both RAC instances were in INTERMEDIATE state — the entire cluster was effectively down for application workloads.
Step 3: Check FRA Usage
SELECT
space_limit / (1024*1024*1024) AS limit_gb,
space_used / (1024*1024*1024) AS used_gb,
space_reclaimable / (1024*1024*1024) AS reclaimable_gb,
ROUND((space_used / space_limit) * 100, 2) AS pct_used
FROM v$recovery_file_dest;
LIMIT_GB USED_GB RECLAIMABLE_GB PCT_USED
-------- ------- -------------- --------
199 198.99 0.00 100.00
100% FRA utilization with zero reclaimable space. The FRA on ASM diskgroup +ARCHIVELOG was completely full with 3,094 archived log files. Before proceeding, the ASM diskgroup was checked — +ARCHIVELOG had 299 GB free, confirming sufficient underlying storage for a size increase.
Step 4: Identify What Was Consuming the FRA
SELECT file_type,
ROUND(space_used / (1024*1024*1024), 2) AS used_gb,
ROUND(space_reclaimable / (1024*1024*1024), 2) AS reclaimable_gb,
number_of_files
FROM v$recovery_area_usage
ORDER BY space_used DESC;
FILE_TYPE USED_GB RECLAIMABLE_GB NUMBER_OF_FILES
----------- ------- -------------- ---------------
ARCHIVED LOG 38.50 0.05 1,200+
BACKUP PIECE 10.20 0.07 45
FLASHBACK LOG 1.30 0.00 12
Over 1,200 archived log files had piled up because the RMAN ARCH backup job had been silently failing for multiple nights.
Step 5: Redo Volume Analysis — Identifying the Nightly Batch Spike
SELECT TO_CHAR(first_time,'YYYY-MM-DD HH24') AS hour,
COUNT(*) AS logs,
ROUND(SUM(blocks*block_size)/1024/1024/1024,2) AS gb
FROM v$archived_log
WHERE dest_id=1
GROUP BY TO_CHAR(first_time,'YYYY-MM-DD HH24')
ORDER BY 1;
| Hour | Archived Logs | GB Generated |
|---|---|---|
| Daytime (typical) | 2–26 | < 1.3 GB |
| 01:00 AM (nightly) | 440–446 | ~31 GB ⚠ |
| 03:00–04:00 AM | 191–411 | 11–28 GB ⚠ |
| 05:00–06:00 AM | 537–796 | 34–50 GB ⚠ |
Clear Pattern — Nightly Batch Job Is the Culprit: Every night between 01:00 AM and 06:00 AM, redo generation spikes to 30–50 GB per hour. Daytime activity rarely exceeds 0.5 GB/hour. With the RMAN ARCH backup silently failing for multiple nights, these logs were never backed up or removed — they simply accumulated until the FRA hit 100%.
The Resolution — Three Actions
Action 1: Manually Delete Archive Logs from ASM
Because RMAN could not run against a full FRA and there was no reclaimable space, older archived logs were manually deleted via ASMCMD:
su - grid
asmcmd
ASMCMD> ls +ARCHIVELOG/PRODDB/ARCHIVELOG/
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_04/*
ASMCMD> rm +ARCHIVELOG/PRODDB/ARCHIVELOG/2026_04_05/*
# Recent logs retained to preserve recoverability
After ASMCMD deletion, the RMAN catalog was resynced:
RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> DELETE EXPIRED ARCHIVELOG ALL;
Important: ASMCMD deletion removes files from ASM but the control file still holds references. Always follow up with CROSSCHECK and DELETE EXPIRED to leave the catalog in a consistent state.
Action 2: Increase FRA Size Online
-- Increase FRA from 199GB to 400GB (online, no restart required)
ALTER SYSTEM SET db_recovery_file_dest_size = 400G SCOPE=BOTH SID='*';
SID='*' applies the change to both RAC instances simultaneously. Immediately after, the archiver restarted automatically:
ARC1: Archival started
ARC2: Archival started
ARC3: Archival started
ORACLE Instance PRODDB1 - Archival Restarted
Applications reconnected and transactions resumed within seconds.
Action 3: Verify Database and Cluster Status
SELECT instance_name, archiver FROM v$instance;
-- PRODDB1 STARTED
SELECT round(space_used/1024/1024/1024,2) used_gb,
round(space_limit/1024/1024/1024,2) limit_gb,
round((space_used/space_limit)*100,2) used_pct
FROM v$recovery_file_dest;
-- 202.80 GB / 400 GB / 50.70%
SELECT group#, status, archived FROM v$log ORDER BY group#;
-- All groups INACTIVE/YES except CURRENT group
Full CRS cluster check via crsctl status res -t confirmed all resources ONLINE — both instances Open, TAF service live on both nodes, all ASM diskgroups healthy. Note: ora.gsd OFFLINE is expected for Oracle 11g RAC — it is a legacy component not required in this configuration.
Root Cause Confirmed: Archive Log Backups Were Not Running
With the database stable, investigation confirmed the RMAN ARCH backup had been silently failing for multiple nights. The cron job fired the script on schedule, but without Oracle environment variables set in the cron context (ORACLE_SID, ORACLE_HOME, PATH all unset), the script exited immediately without connecting to RMAN. The log at /tmp/RMAN_ARCH.log appeared empty — no RMAN error, nothing surfacing to the team.
Complete Incident Chain:
ARCH backup silently fails (missing Oracle env vars in cron) → 3,094 archived logs accumulate over multiple nights → Heavy batch redo (30–50 GB/hr, 01:00–06:00 AM) → FRA hits 100% (198.99 GB / 199 GB) → ARCn cannot write → Both instances enter INTERMEDIATE state → ORA-00257 → Applications hang → Archive logs manually deleted via ASMCMD → FRA expanded to 400 GB → Archiver restarts → Cron fix applied → ARCH backup reruns successfully → FRA space recovered → Cluster fully healthy
Fix: Correct the Cron Environment
Broken cron entry:
05 * * * * /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1
Fixed cron entry:
05 * * * * . /home/oracle/.bash_profile && /home/oracle/scripts/RMAN_Hotbackup.ksh ARCH > /tmp/RMAN_ARCH.log 2>&1
ARCH Backup Reruns and Completes
After the cron fix, the ARCH backup was immediately re-run manually to clear the remaining backlog and validate the fix end-to-end:
. /home/oracle/.bash_profile
/home/oracle/scripts/RMAN_Hotbackup.ksh ARCH
The backup completed successfully. RMAN backed up accumulated archived logs, marked them as obsolete per retention policy, and removed them from the FRA — freeing additional space beyond what the manual ASMCMD deletion had already recovered.
Preventive Measures Put in Place
- Fix the cron environment — all RMAN jobs now source
~/.bash_profilebefore executing. - RMAN exit code alerting — script updated to email the DBA team on any non-zero RMAN exit status.
- FRA monitoring at 70% threshold — SQL script runs every 30 minutes via cron and alerts when FRA crosses 70%, giving enough runway before the database is impacted.
- FRA sized on measured data — 400 GB based on actual peak batch redo (30–50 GB/hr), not an arbitrary estimate.
- RAC: always use SID=’*’ — and always verify ASM diskgroup free space before increasing the FRA limit.
Key Takeaways for Oracle DBAs
- ORA-00257 always points to the FRA being full — check
v$recovery_file_destandv$recovery_area_usageimmediately. - When RMAN cannot run against a full FRA, use ASMCMD to manually delete older archive logs — then run
CROSSCHECKandDELETE EXPIREDto resync the catalog. db_recovery_file_dest_sizeis dynamic — increase it online without a restart. Always confirm ASM diskgroup free space first.- Use
SID='*'in RAC — applies the parameter change to all instances simultaneously. - The archiver restarts automatically after FRA space is freed — no manual ARCn intervention needed.
- Silent cron failures are dangerous — always validate Oracle environment variables in cron context, and capture RMAN exit codes with alerting.
- Monitor FRA proactively at 70% — this gives enough runway for corrective action before hitting 100%.
- After a manual ASMCMD deletion, always follow up with a proper RMAN backup run to leave the FRA in a consistent state.
- Analyse redo generation patterns —
v$archived_logreveals batch spikes that should directly inform FRA sizing decisions.
Conclusion
A Stuck Archiver on a production RAC database is a high-severity incident, but a structured approach resolves it quickly. The immediate fix required two steps: manually deleting archived logs via ASMCMD to unblock the archiver, and increasing db_recovery_file_dest_size online to prevent immediate recurrence.
The deeper fix was identifying why the RMAN ARCH backup had been silently failing for multiple nights — a missing Oracle environment in the cron context. Once found, the fix was a single line in crontab. Rerunning the backup manually and watching it complete successfully, with FRA usage dropping as a result, confirmed the incident was fully resolved rather than just patched.
The combination of a corrected cron environment, RMAN exit code alerting, and proactive FRA monitoring at 70% ensures this class of incident does not repeat.
Syed Anwar Ahmed | Oracle Apps DBA
Leave a comment