🔴 Incident Overview
| Severity | P1 — Two production database backups failed |
| Environment | Oracle 19c (19.30) on Linux x86-64 |
| Backup Tool | RMAN with Recovery Catalog |
| Backup Volume | /dbbackup — 1TB LVM filesystem |
| Databases Affected | DBPRO01 (12:30 failure), DBPRO02 (22:30 failure) |
| Total Space Recovered | ~492G (disk went from 100% → 52%) |
1. The Alerts — Two Failures, Same Root Cause
It started with an RMAN failure at 22:30. The backup script for DBPRO02 fired on schedule and died within 2 minutes. The RMAN log told the story clearly:
RMAN-03009: failure of backup command on c4 channel at 22:32:13
ORA-19502: write error on file "/dbbackup/DBPRC02/rman/DiffInc_DBPRC02_4u4mi8bf"
ORA-27072: File I/O error
Additional information: 4
Three more channels followed — c1, c2, c3 — all crashing at exactly 22:32:48. When multiple channels fail simultaneously at the same timestamp, it almost always means one thing: the destination filesystem just hit 100%.
A quick check confirmed it:
$ df -hP /dbbackup
Filesystem Size Used Avail Use%
/dev/mapper/orabkupvg-orabkuplv1 1023G 1020G 3.3G 100%
1TB volume. 3.3G free. Completely full.
What we did not know yet — digging into backup history would reveal that DBPRO01 had already failed at 12:30 that same day for the same reason, 10 hours earlier. Two databases unprotected on the same night.
2. The Investigation — Folder by Folder
The first step was understanding what was consuming the disk. One command gave us the top-level picture:
$ du -sh /dbbackup/*
744G DBPRC01
152G ColdBackup_11April2026
41G DBPRC02
26G DBPRC03
15G DBPRO01
7.7G JAN2026_CPU
7.7G OCT2025_CPU
5.3G infra_arch
744G inside DBPRC01 alone — 73% of the entire disk. That was our primary suspect.
Drilling into DBPRC01
$ du -sh /dbbackup/DBPRC01/rman/* | sort -rh | head -10
7.5G DiffInc_DBPRC01_fl4lumij
7.5G DiffInc_DBPRC01_eg4lrnh5
7.5G DiffInc_DBPRC01_b44lc7uh
...
Every single file was a DiffInc_ or ArchivelogAll_ backup piece. No variety. No cleanup. Just backup after backup piling up.
$ ls /dbbackup/DBPRC01/rman/ | wc -l
1522
1,522 backup pieces. We checked the oldest and newest:
Oldest file on disk: 2022-05-07
Newest file on disk: 2026-04-25
Four years of backup files on disk — or so we thought.
3. The RMAN Investigation — Where Things Got Interesting
We connected RMAN to the database and ran the retention check:
RMAN> SHOW RETENTION POLICY;
CONFIGURE RETENTION POLICY TO REDUNDANCY 30;
REDUNDANCY 30. This tells RMAN to keep the last 30 complete backup copies of every datafile before considering anything obsolete.
Next logical step — check what RMAN considers obsolete:
RMAN> REPORT OBSOLETE;
no obsolete backups found
Nothing? With 1,522 files on disk?
We ran CROSSCHECK BACKUP — all 1,693 objects came back AVAILABLE. Then we checked the actual date range RMAN was tracking from the database control file:
SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
COUNT(*) total_pieces
FROM v$backup_piece_details
WHERE status = 'A';
OLDEST NEWEST TOTAL_PIECES
03-DEC-2025 25-APR-2026 1541
The control file only tracks pieces from December 2025 onwards — about 5 months. The 2022/2023 files seen on disk were old directories and scripts, not backup pieces. All 1,541 current pieces were legitimate and RMAN considered every one of them necessary under REDUNDANCY 30.
This was the key insight: RMAN was not broken. The retention policy itself was the problem.
4. Root Cause — The Architecture Trap
The deeper investigation revealed something unexpected. Looking at the actual RMAN backup script:
connect target rman/password@DBPRO01
...
format '/dbbackup/DBPRC01/rman/DiffInc_%d_%u'
(database);
...
delete obsolete;
DBPRO01 (the production database) was backing up INTO the DBPRC01 directory. The directory names suggested one database but contained another database’s backups entirely. The naming convention was PRO to PRC — production database backups stored in the production-copy directory.
This pattern existed for all three database pairs on the server. Each production database backed up into its corresponding copy directory.
The delete obsolete command was in the script — but with REDUNDANCY 30 and weekly Level 0 backups, obsolete only kicks in after 30 complete Level 0 cycles. That is 30 weeks = 7.5 months of retention. Since the current tracking window was only 5 months, delete obsolete ran every night and found absolutely nothing to delete.
The math:
| Retention policy | REDUNDANCY 30 |
| Level 0 frequency | Weekly (Sundays) |
| Effective retention period | ~30 weeks / 7.5 months |
| Backup tracking since | December 2025 (~5 months) |
| Result | delete obsolete finds nothing — ever |
| Daily backup size | ~7–7.5G per run |
| Total accumulated | 744G |
Adding fuel to the fire — the patching activity on April 18 triggered an extra Level 0 backup, followed by the regular Sunday Level 0 on April 19. Two large Level 0 runs (~27G each) within 24 hours wrote the final ~54G that pushed the disk over the edge.
5. Secondary Findings During Investigation
OCT2025 CPU Patch Artifacts (7.7G)
The October 2025 CPU patch files (zip archives + extracted directories) were still sitting in /dbbackup/OCT2025_CPU/. A quick OPatch check confirmed the database had since been patched to 19.30 (January 2026 RU) — the October 2025 patches were fully superseded and rolled back from inventory. Safe to delete immediately.
$ $ORACLE_HOME/OPatch/opatch lsinventory | grep -E "38291812|38194382"
# Empty — neither Oct 2025 patch in inventory anymore
5-Year-Old Pre-Migration Export Dumps
Three directories contained Oracle 11.2.0.4 export dumps from January–March 2021 — taken before the migration to 19c. With the database now running 19.30, these had zero recovery value but occupied ~14G collectively. Flagged for manager approval before deletion.
Recovery Catalog Version Mismatch
The original RMAN log flagged this warning:
PL/SQL package RMAN.DBMS_RCVCAT version 19.11.00.00 in RCVCAT database is not current
PL/SQL package RMAN.DBMS_RCVMAN version 19.11.00.00 in RCVCAT database is not current
The recovery catalog is running 19.11 packages while the RMAN client is now 19.30. Non-critical tonight but requires UPGRADE CATALOG in the next maintenance window.
6. The Fix — Emergency Space Recovery
With management approval obtained, we executed a time-based delete — keeping the last 30 days of backups and removing everything older:
RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
This command does three things atomically:
- Queries catalog/controlfile for all pieces completed before the cutoff date
- Deletes the physical files from disk
- Removes the records from RMAN catalog — no orphaned entries, no catalog drift
The output scrolled for several minutes:
deleted backup piece
backup piece handle=/dbbackup/DBPRC01/rman/DiffInc1_DBPRC01_6u4jaaih ...
deleted backup piece
backup piece handle=/dbbackup/DBPRC01/rman/ArchivelogAll_DBPRC01_784jab6o ...
...
Deleted 1072 objects
1,072 backup pieces deleted. Catalog updated. Disk checked:
BEFORE: Used 1020G Avail 3.3G (100%)
AFTER: Used 536G Avail 488G (53%)
Then the OCT2025_CPU directory was removed:
$ rm -rf /dbbackup/OCT2025_CPU/
$ df -hP /dbbackup
Used 528G Avail 495G (52%)
Final result: 495G free. Disk at 52%.
Both failed backups were re-submitted immediately and ran successfully in parallel:
$ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO01.sh &
$ nohup sh /opt/oracle/scripts/rman/rman_backup_DBPRO02.sh &
$ jobs -l
[1] Running nohup sh ...rman_backup_DBPRO01.sh &
[2] Running nohup sh ...rman_backup_DBPRO02.sh &
7. Incident Timeline
| 12:30 | DBPRO01 Level 0 backup fails — ORA-19502/ORA-27072 (disk full) |
| 22:30 | DBPRO02 Level 0 backup fails — same errors, all 4 channels |
| 23:08 | Investigation begins — df -hP /dbbackup confirms 100% full |
| 23:15 | DBPRC01 directory identified as 744G consumer |
| 23:25 | RMAN connected — REDUNDANCY 30 discovered |
| 23:35 | Architecture confirmed — PRO databases backing up into PRC directories |
| 23:45 | Root cause confirmed — 7.5-month retention, delete obsolete finds nothing |
| 23:50 | DELETE BACKUP COMPLETED BEFORE SYSDATE-30 executed |
| 23:51 | 1,072 pieces deleted — disk drops to 53% |
| 23:55 | OCT2025_CPU removed — disk at 52%, 495G free |
| 00:00 | Both backup jobs re-submitted and running successfully |
8. Permanent Fix Recommendations
Fix 1 — Change Retention Policy to RECOVERY WINDOW
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
REDUNDANCY 30 with weekly Level 0s means 7.5 months of retention — far beyond what any production SLA requires. A 14-day recovery window keeps 2 weeks of backups regardless of backup frequency, and delete obsolete will actually find and remove old pieces going forward.
Fix 2 — Add Pre-Backup Space Check to Cron Script
#!/bin/bash
BACKUP_FS="/dbbackup"
THRESHOLD=20
AVAIL_PCT=$(df -hP $BACKUP_FS | awk 'NR==2 {gsub(/%/,""); print 100-$5}')
if [ "$AVAIL_PCT" -lt "$THRESHOLD" ]; then
echo "ABORT: $BACKUP_FS is ${AVAIL_PCT}% free — below ${THRESHOLD}% threshold" \
| mailx -s "BACKUP ABORTED: Low space on $BACKUP_FS" $MAILTO
exit 1
fi
A failing backup that writes 3G before dying is worse than a backup that never starts — it wastes the last 3G of free space and leaves partial pieces on disk.
Fix 3 — Upgrade the Recovery Catalog
RMAN> CONNECT TARGET /
RMAN> CONNECT CATALOG rman/password@rmancat
RMAN> UPGRADE CATALOG;
RMAN> UPGRADE CATALOG; -- run twice as prompted
The catalog is 2 major patch levels behind the RMAN client. Some catalog-dependent operations will start failing if left unaddressed.
Fix 4 — Filesystem Monitoring Alert
The FRA check scripts already email on FRA usage above 80%. The same pattern should exist for /dbbackup. A simple cron entry checking disk usage every hour with alert at 80% would have caught this days before the disk hit 100%.
9. Key Takeaways for Oracle DBAs
REDUNDANCY N is not always safer than RECOVERY WINDOW. REDUNDANCY 30 with weekly Level 0 backups means 7.5 months of retention — likely far beyond your RPO requirement and a silent space accumulator.
- Always verify what delete obsolete actually deletes. If it finds nothing to delete every single night, that is a warning sign — not reassurance.
- Check backup naming conventions carefully. When a directory named DBPRC01 contains DBPRO01 backups, retention policies applied to the wrong database RMAN configuration control the cleanup behavior.
- Patching days generate oversized backups. A Level 0 taken manually on patch day plus the regular Sunday Level 0 the next day equals 2x the normal space consumption in 24 hours. Ensure extra headroom exists going into patch windows.
- Use DELETE BACKUP COMPLETED BEFORE SYSDATE-N for emergency cleanup — not OS-level rm. RMAN deletes atomically update both the physical files and the catalog, preventing expired/orphaned piece confusion later.
- Never use rm on RMAN backup pieces directly unless you follow up with CROSSCHECK BACKUP and DELETE EXPIRED BACKUP to sync the catalog.
10. Commands Reference — Quick Cheat Sheet
-- Check retention policy
RMAN> SHOW RETENTION POLICY;
-- Preview what would be deleted (dry run)
RMAN> REPORT OBSOLETE;
RMAN> LIST BACKUP COMPLETED BEFORE 'SYSDATE-30';
-- Emergency cleanup — delete pieces older than 30 days
RMAN> DELETE NOPROMPT BACKUP COMPLETED BEFORE 'SYSDATE-30';
-- Standard cleanup based on retention policy
RMAN> DELETE NOPROMPT OBSOLETE;
-- Sync catalog after any OS-level file operations
RMAN> CROSSCHECK BACKUP;
RMAN> DELETE NOPROMPT EXPIRED BACKUP;
-- Change to time-based retention (recommended)
RMAN> CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 14 DAYS;
-- Check backup piece date range in control file
SELECT TO_CHAR(MIN(completion_time),'DD-MON-YYYY') oldest,
TO_CHAR(MAX(completion_time),'DD-MON-YYYY') newest,
COUNT(*) total_pieces
FROM v$backup_piece_details
WHERE status = 'A';
-- Check backup history
SELECT session_key, input_type, status,
TO_CHAR(start_time,'YYYY-MM-DD HH24:MI:SS') start_time,
output_bytes_display, time_taken_display
FROM v$rman_backup_job_details
ORDER BY start_time DESC;
Conclusion
What appeared to be a simple disk full incident turned out to involve a multi-database backup architecture, a misconfigured retention policy, and a cleanup mechanism that was technically running correctly but never finding anything to clean. The fix itself — one RMAN command — took under 5 minutes. The real work was the systematic investigation to understand exactly what was safe to delete and why.
That is Oracle DBA work in a nutshell: the fix is often simple; understanding why it is safe to run is the real job.
If you found this useful, connect with me on LinkedIn or explore more Oracle DBA scripts on my GitHub. More incident walkthroughs at syedanwarahmedoracle.blog.
Leave a comment