Tag: technology

Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem
A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.

Initial Observations

Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.

The Turning Point

Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:
```
SELECT status,
       ROUND(SUM(bytes)/1024/1024,2) MB
FROM dba_undo_extents
GROUP BY status;
```
Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.

Key Insight

EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.

Real Usage Calculation
```
SELECT ROUND(
       SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
       / SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents;
```
Actual usage was only ~45% — not 97%.

Root Cause of the False Alert

The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.

The Correct Monitoring Approach
```
SELECT tablespace_name,
       ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
/ SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents
GROUP BY tablespace_name;
```
Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts

Key Takeaways
- UNDO is not a regular tablespace — it has its own lifecycle
- DBA_FREE_SPACE is misleading for UNDO monitoring
- Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
- Build context-aware monitoring that aligns with Oracle internals
- Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn | GitLab
30/03/2026
Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes
Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.

Observed Symptoms
```
ORA-00031: session marked for kill
Thread 1 cannot allocate new log
Private strand flush not complete
```
Phase 1: Interpreting ORA-00031 Correctly

ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.

Phase 2: Identifying the True Performance Signal

The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.

Phase 3: Evidence-Based Analysis (Read-Only)

Redo switch frequency was analyzed to validate system behavior:
```
SELECT
    TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
    COUNT(*) AS switches
FROM v$log_history
WHERE first_time > SYSDATE - 1
GROUP BY TRUNC(first_time, 'HH24')
ORDER BY 1;
```
Findings

Metric Observation
Average Switch Rate 5-7 per hour
Peak Rate 8-10 per hour during business hours
Off-Peak Rate 1-3 per hour

A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.

Why No Changes Were Made

In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.

Recommendations
- Continuously monitor redo switch frequency during peak windows
- Use collected data to justify future redo log sizing via change management
- Avoid unnecessary intervention when behavior is transient and non-impacting
- Distinguish informational alert log messages from actionable errors
Key Takeaways
- ORA-00031 is expected and harmless — it confirms session termination
- Redo allocation waits are transient under sustained load
- Proper analysis prevents unnecessary production intervention
- Not all alert log warnings indicate failure — some are early signals of workload growth
- The goal is not to eliminate every alert, but to understand which ones matter
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Resolving Critical FRA Space Issue Using RMAN Optimization (Real-World DBA Scenario)
During routine monitoring, a critical issue was identified where the Fast Recovery Area (FRA) utilization reached 95%, putting the database at risk of ORA-19809, archiver process failure, and potential database hang. Despite having an RMAN retention policy configured, the FRA continued to grow uncontrollably.

Investigation
```
-- Check FRA usage
SELECT name, space_limit/1024/1024/1024 limit_gb,
       space_used/1024/1024/1024 used_gb,
       ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;

-- Check what is consuming space
SELECT file_type, percent_space_used, number_of_files
FROM v$flash_recovery_area_usage
ORDER BY percent_space_used DESC;

-- Review RMAN configuration
SHOW ALL;
```
Findings: backupsets were consuming the majority of FRA space. RMAN had a 7-day retention policy configured but backup optimization was disabled and archivelog deletion policy was not set.

Root Cause

The retention policy was defined but obsolete backups were not being automatically deleted. There was no scheduled execution of DELETE OBSOLETE, backup optimization was disabled causing redundant data, and archivelog lifecycle was not controlled. This led to continuous accumulation of backupsets and archivelogs.

Immediate Fix
```
-- Run RMAN cleanup
RMAN> CROSSCHECK BACKUP;
RMAN> DELETE EXPIRED BACKUP;
RMAN> DELETE OBSOLETE;

-- Verify FRA usage after cleanup
SELECT ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;
```
FRA utilization reduced from 95% to 83% after cleanup.

Configuration Fixes
```
-- Enable backup optimization
CONFIGURE BACKUP OPTIMIZATION ON;

-- Set archivelog deletion policy
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO DISK;

-- Enable compressed backups
CONFIGURE DEVICE TYPE DISK BACKUP TYPE TO COMPRESSED BACKUPSET;
```
Automation

Scheduled daily execution of DELETE OBSOLETE via cron:
```
-- Add to RMAN backup script
DELETE NOPROMPT OBSOLETE;

# Crontab entry for daily cleanup
0 2 * * * /path/to/rman_cleanup.sh >> /tmp/rman_cleanup.log 2>&1
```
Key Takeaways
- RMAN retention policy does not enforce deletion automatically — you must schedule DELETE OBSOLETE
- Backup lifecycle management must include automation
- Monitor FRA usage proactively to prevent saturation
- Enable backup optimization to avoid redundant data
- Set archivelog deletion policy to control archivelog lifecycle
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
25/03/2026

Metric	Observation
Average Switch Rate	5-7 per hour
Peak Rate	8-10 per hour during business hours
Off-Peak Rate	1-3 per hour

Tag: technology

Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem

Initial Observations

The Turning Point

Key Insight

Real Usage Calculation

Root Cause of the False Alert

The Correct Monitoring Approach

Key Takeaways

Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

Observed Symptoms

Phase 1: Interpreting ORA-00031 Correctly

Phase 2: Identifying the True Performance Signal

Phase 3: Evidence-Based Analysis (Read-Only)

Findings

Why No Changes Were Made

Recommendations

Key Takeaways

Resolving Critical FRA Space Issue Using RMAN Optimization (Real-World DBA Scenario)

Investigation

Root Cause

Immediate Fix

Configuration Fixes

Automation

Key Takeaways