Tag: technology

  • Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem

    A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.


    Initial Observations

    Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.


    The Turning Point

    Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:

    SELECT status,
           ROUND(SUM(bytes)/1024/1024,2) MB
    FROM dba_undo_extents
    GROUP BY status;

    Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.


    Key Insight

    EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.


    Real Usage Calculation

    SELECT ROUND(
           SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
           / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents;

    Actual usage was only ~45% — not 97%.


    Root Cause of the False Alert

    The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.


    The Correct Monitoring Approach

    SELECT tablespace_name,
           ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
    / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents
    GROUP BY tablespace_name;

    Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts


    Key Takeaways

    • UNDO is not a regular tablespace — it has its own lifecycle
    • DBA_FREE_SPACE is misleading for UNDO monitoring
    • Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
    • Build context-aware monitoring that aligns with Oracle internals
    • Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn  |  GitLab

  • Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

    Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.


    Observed Symptoms

    ORA-00031: session marked for kill
    Thread 1 cannot allocate new log
    Private strand flush not complete

    Phase 1: Interpreting ORA-00031 Correctly

    ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.


    Phase 2: Identifying the True Performance Signal

    The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.


    Phase 3: Evidence-Based Analysis (Read-Only)

    Redo switch frequency was analyzed to validate system behavior:

    SELECT
        TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
        COUNT(*) AS switches
    FROM v$log_history
    WHERE first_time > SYSDATE - 1
    GROUP BY TRUNC(first_time, 'HH24')
    ORDER BY 1;

    Findings

    MetricObservation
    Average Switch Rate5-7 per hour
    Peak Rate8-10 per hour during business hours
    Off-Peak Rate1-3 per hour

    A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.


    Why No Changes Were Made

    In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.


    Recommendations

    • Continuously monitor redo switch frequency during peak windows
    • Use collected data to justify future redo log sizing via change management
    • Avoid unnecessary intervention when behavior is transient and non-impacting
    • Distinguish informational alert log messages from actionable errors

    Key Takeaways

    • ORA-00031 is expected and harmless — it confirms session termination
    • Redo allocation waits are transient under sustained load
    • Proper analysis prevents unnecessary production intervention
    • Not all alert log warnings indicate failure — some are early signals of workload growth
    • The goal is not to eliminate every alert, but to understand which ones matter

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Resolving Critical FRA Space Issue Using RMAN Optimization (Real-World DBA Scenario)

    During routine monitoring, a critical issue was identified where the Fast Recovery Area (FRA) utilization reached 95%, putting the database at risk of ORA-19809, archiver process failure, and potential database hang. Despite having an RMAN retention policy configured, the FRA continued to grow uncontrollably.


    Investigation

    -- Check FRA usage
    SELECT name, space_limit/1024/1024/1024 limit_gb,
           space_used/1024/1024/1024 used_gb,
           ROUND(space_used/space_limit*100,2) pct_used
    FROM v$recovery_file_dest;
    
    -- Check what is consuming space
    SELECT file_type, percent_space_used, number_of_files
    FROM v$flash_recovery_area_usage
    ORDER BY percent_space_used DESC;
    
    -- Review RMAN configuration
    SHOW ALL;

    Findings: backupsets were consuming the majority of FRA space. RMAN had a 7-day retention policy configured but backup optimization was disabled and archivelog deletion policy was not set.


    Root Cause

    The retention policy was defined but obsolete backups were not being automatically deleted. There was no scheduled execution of DELETE OBSOLETE, backup optimization was disabled causing redundant data, and archivelog lifecycle was not controlled. This led to continuous accumulation of backupsets and archivelogs.


    Immediate Fix

    -- Run RMAN cleanup
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE EXPIRED BACKUP;
    RMAN> DELETE OBSOLETE;
    
    -- Verify FRA usage after cleanup
    SELECT ROUND(space_used/space_limit*100,2) pct_used
    FROM v$recovery_file_dest;

    FRA utilization reduced from 95% to 83% after cleanup.


    Configuration Fixes

    -- Enable backup optimization
    CONFIGURE BACKUP OPTIMIZATION ON;
    
    -- Set archivelog deletion policy
    CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO DISK;
    
    -- Enable compressed backups
    CONFIGURE DEVICE TYPE DISK BACKUP TYPE TO COMPRESSED BACKUPSET;

    Automation

    Scheduled daily execution of DELETE OBSOLETE via cron:

    -- Add to RMAN backup script
    DELETE NOPROMPT OBSOLETE;
    
    # Crontab entry for daily cleanup
    0 2 * * * /path/to/rman_cleanup.sh >> /tmp/rman_cleanup.log 2>&1

    Key Takeaways

    • RMAN retention policy does not enforce deletion automatically — you must schedule DELETE OBSOLETE
    • Backup lifecycle management must include automation
    • Monitor FRA usage proactively to prevent saturation
    • Enable backup optimization to avoid redundant data
    • Set archivelog deletion policy to control archivelog lifecycle

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn