Category: Troubleshooting

Oracle Database and EBS real-world troubleshooting guides and RCA

  • Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem

    A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.


    Initial Observations

    Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.


    The Turning Point

    Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:

    SELECT status,
           ROUND(SUM(bytes)/1024/1024,2) MB
    FROM dba_undo_extents
    GROUP BY status;

    Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.


    Key Insight

    EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.


    Real Usage Calculation

    SELECT ROUND(
           SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
           / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents;

    Actual usage was only ~45% — not 97%.


    Root Cause of the False Alert

    The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.


    The Correct Monitoring Approach

    SELECT tablespace_name,
           ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
    / SUM(bytes) * 100, 2) actual_used_pct
    FROM dba_undo_extents
    GROUP BY tablespace_name;

    Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts


    Key Takeaways

    • UNDO is not a regular tablespace — it has its own lifecycle
    • DBA_FREE_SPACE is misleading for UNDO monitoring
    • Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
    • Build context-aware monitoring that aligns with Oracle internals
    • Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn  |  GitLab

  • Rethinking UNDO Tablespace Monitoring in Multitenant Environments: A Case Study on APPS_UNDOTS1

    Introduction

    A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.

    This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.


    The Incident

    An alert was triggered showing:

    Tablespace              TOTAL_MB     USED_MB     FREE_MB PCT_USED
    -------------------- ----------- ----------- ----------- --------
    APPS_UNDOTS1 28,863.25 28,681.56 181.69 99.37

    At first glance, this indicated:

    • UNDO nearly full
    • Minimal free space
    • Risk of ORA-30036 / ORA-01555

    Environment Context

    This issue occurred in a Multitenant environment with Local UNDO enabled:

    • UNDO tablespace: APPS_UNDOTS1
    • Scope: PDB-level
    • Each PDB maintains its own UNDO

    UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.


    Investigation Approach

    Rather than relying on % used, a state-based analysis was performed.


    Step 1: Active Transactions

    SELECT COUNT(*) FROM v$transaction;

    Observation:

    • Minimal active transactions
    • No ongoing workload pressure

    Step 2: UNDO Extent Analysis

    SELECT status, SUM(bytes)/1024/1024 MB
    FROM dba_undo_extents
    GROUP BY status;

    Key Observation

    STATUS        MB         PCT
    ----------- -------- -------
    ACTIVE ~3 MB ~0.01%
    UNEXPIRED ~51 MB ~0.18%
    EXPIRED ~28 GB ~99.81%

    💡 Critical Insight

    ~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable

    This means:

    • No real space pressure
    • No transaction risk
    • Tablespace is effectively available

    Understanding UNDO Behavior

    UNDO extents follow a lifecycle:

    ACTIVE → UNEXPIRED → EXPIRED → REUSED
    StateDescriptionReusable
    ACTIVEUsed by active transactions
    UNEXPIREDRetained for consistency⚠️
    EXPIREDNo longer needed

    UNDO extents are not physically freed but logically reused by Oracle based on demand.


    Why the Alert Was Misleading

    The alert was based on:

    SELECT * FROM dba_free_space;

    Limitation

    • Shows only physically free space
    • Does not reflect reusable UNDO
    • Ignores Oracle’s internal reuse mechanism

    Root Cause

    A long-running concurrent request:

    • Ran for ~30 hours
    • Performed heavy DELETE operations
    • Generated significant UNDO

    During execution:

    • UNDO reached ~99% → real pressure

    After completion:

    • UNDO became EXPIRED
    • Space became reusable
    • Alert persisted → false positive

    When UNDO Is Actually a Problem

    ACTIVE      > 20–30%
    UNEXPIRED very high
    EXPIRED very low

    UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.


    Recommended Monitoring Approach

    WITH undo AS (
    SELECT status, SUM(bytes)/1024/1024 mb
    FROM dba_undo_extents
    GROUP BY status
    ),
    total AS (
    SELECT SUM(mb) total_mb FROM undo
    )
    SELECT
    u.status,
    ROUND(u.mb,2) AS mb,
    ROUND((u.mb / t.total_mb) * 100, 2) AS pct
    FROM undo u, total t;

    Improved Alert Logic

    • ACTIVE > 20% → Critical
    • UNEXPIRED > 80% → Warning
    • Otherwise → Normal

    Key Takeaways

    • UNDO is state-driven, not space-driven
    • EXPIRED undo = reusable capacity
    • % used is misleading
    • Always analyze at PDB level in multitenant setups
    • Monitoring must align with Oracle internals

    Conclusion

    UNDO tablespace monitoring must evolve from:

    ❌ Static space-based metrics
    ➡️
    ✅ Dynamic lifecycle-based analysis

    A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.


    Level Insight

    “UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”


    Final Thought

    👉 Don’t ask:
    “How full is UNDO?”

    👉 Ask:
    “How much of it is actually in use?”


    Author

    Syed Anwar Ahmed
    Oracle Apps DBA | Oracle EBS | Performance & Troubleshooting

  • Oracle EBS Performance Issue: Inactive Forms Session Holding TX Locks (RCA and Resolution)

    In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.


    Observed Symptoms

    • Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
    • Concurrent programs stuck in running state
    • Increased database wait event: enq: TX - row lock contention

    Investigation

    Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.

    -- Identify blocking sessions
    SELECT blocking_session, sid, wait_class, event
    FROM v$session
    WHERE blocking_session IS NOT NULL;
    
    -- Detect inactive sessions with active transactions
    SELECT s.sid, s.serial#, s.status, s.program,
           s.username, t.start_time,
           ROUND(s.last_call_et/3600,2) hrs_inactive
    FROM v$session s, v$transaction t
    WHERE s.saddr = t.ses_addr
    AND s.status = 'INACTIVE'
    ORDER BY hrs_inactive DESC;
    
    -- Analyze full wait chain
    SELECT * FROM dba_wait_chains;

    The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.


    Root Cause

    An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.

    -- The problematic SQL pattern
    SELECT *
    FROM WSH_DELIVERY_DETAILS
    WHERE ROWID = :B1
    FOR UPDATE NOWAIT;

    Resolution

    The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:

    -- Kill blocking session (only after full validation and approval)
    ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;

    Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.


    Preventive Measures

    • Implement idle session timeout policies
    • Educate users on proper transaction handling in Oracle Forms
    • Review custom code using FOR UPDATE — keep transactions short and commit promptly
    • Monitor long-running and idle transactions proactively

    Key Takeaways

    • An inactive session can still hold active transactions and critical locks
    • Always identify the root blocker — intermediate sessions are symptoms, not the cause
    • Application-level inactivity does not mean database-level inactivity
    • In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Database Session Spike Mystery: When Connection Pools Collide

    In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.


    The Incident

    An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.

    -- Quick session count check
    SELECT COUNT(*) FROM gv$session;
    -- Result: 216 (already returning to normal)
    
    -- Session breakdown by status
    SELECT status, COUNT(*) cnt
    FROM gv$session
    GROUP BY status
    ORDER BY cnt DESC;

    Root Cause

    Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:

    Old Sessions (Inactive) + New Sessions (Active) = Session Surge

    As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.


    Recommendations

    • Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
    • Monitor inactive session trends to detect abnormal accumulation early
    • Configure appropriate idle timeout, maximum pool size, and session reuse settings

    Key Takeaways

    • Not all session spikes are database problems — check middleware behavior first
    • High session count does not necessarily indicate database stress
    • Transient issues still require analysis as they reveal architectural inefficiencies
    • Database alerts can originate from upstream connection management behavior

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes

    Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.


    Observed Symptoms

    ORA-00031: session marked for kill
    Thread 1 cannot allocate new log
    Private strand flush not complete

    Phase 1: Interpreting ORA-00031 Correctly

    ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.


    Phase 2: Identifying the True Performance Signal

    The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.


    Phase 3: Evidence-Based Analysis (Read-Only)

    Redo switch frequency was analyzed to validate system behavior:

    SELECT
        TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
        COUNT(*) AS switches
    FROM v$log_history
    WHERE first_time > SYSDATE - 1
    GROUP BY TRUNC(first_time, 'HH24')
    ORDER BY 1;

    Findings

    MetricObservation
    Average Switch Rate5-7 per hour
    Peak Rate8-10 per hour during business hours
    Off-Peak Rate1-3 per hour

    A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.


    Why No Changes Were Made

    In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.


    Recommendations

    • Continuously monitor redo switch frequency during peak windows
    • Use collected data to justify future redo log sizing via change management
    • Avoid unnecessary intervention when behavior is transient and non-impacting
    • Distinguish informational alert log messages from actionable errors

    Key Takeaways

    • ORA-00031 is expected and harmless — it confirms session termination
    • Redo allocation waits are transient under sustained load
    • Proper analysis prevents unnecessary production intervention
    • Not all alert log warnings indicate failure — some are early signals of workload growth
    • The goal is not to eliminate every alert, but to understand which ones matter

    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn

  • Oracle DB Health Check Scripts: Real-World DBA Monitoring Guide

    In day-to-day Oracle Database and E-Business Suite (EBS) administration, proactive monitoring is critical to ensure system stability, performance, and availability. As part of my real-world DBA experience, I created a set of health check scripts to quickly assess the status of key database components.


    Objective

    The goal of these scripts is to provide a quick and efficient health check covering database status, instance status, tablespace usage, invalid objects, and RMAN backup status — all from a single SQL*Plus session.


    Solution Overview

    I developed a SQL-based script (hc.sql) that gathers essential health metrics from Oracle dynamic views and DBA tables. The scripts are designed to be lightweight, fast, and provide immediate actionable insights.


    GitHub Repository

    The complete script is available on GitHub:
    https://github.com/Syedanwarahmed/scripts_for_healthcheck


    Key Features

    • Database and instance status check
    • Tablespace usage monitoring
    • Active session tracking
    • Invalid object detection
    • RMAN backup job verification
    • Easy execution using SQL*Plus

    How to Use

    sqlplus / as sysdba
    @hc.sql

    Sample Checks Included

    Database Status

    SELECT NAME, OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;

    Tablespace Usage

    SELECT tablespace_name,
           ROUND(used_space * 8192 / 1024 / 1024, 2) used_mb,
           ROUND(tablespace_size * 8192 / 1024 / 1024, 2) total_mb,
           ROUND(used_percent, 2) pct_used
    FROM dba_tablespace_usage_metrics
    ORDER BY used_percent DESC;

    Active Sessions

    SELECT COUNT(*) active_sessions FROM V$SESSION WHERE STATUS='ACTIVE';
    
    -- Session breakdown by status
    SELECT status, COUNT(*) cnt
    FROM v$session
    GROUP BY status
    ORDER BY cnt DESC;

    Invalid Objects

    SELECT owner, object_type, COUNT(*) cnt
    FROM dba_objects
    WHERE status = 'INVALID'
    GROUP BY owner, object_type
    ORDER BY cnt DESC;

    RMAN Backup Status

    SELECT status, start_time, end_time,
           ROUND((output_bytes/1024/1024/1024),2) output_gb
    FROM V$RMAN_BACKUP_JOB_DETAILS
    ORDER BY start_time DESC
    FETCH FIRST 5 ROWS ONLY;

    Real-World Value

    These scripts are derived from real-time production support scenarios where quick diagnosis is required during high CPU issues, backup failures, tablespace alerts, and performance degradation. Having a single script to validate system health saves valuable time during critical situations.


    Future Enhancements

    I plan to enhance this repository further by adding shell automation scripts, alerting mechanisms, integration with monitoring tools, and additional EBS-specific checks.


    Conclusion

    A well-designed health check script is an essential tool for every DBA. It not only helps in proactive monitoring but also ensures faster troubleshooting and improved system reliability. Feel free to explore the repository, use the scripts, and share your feedback.


    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn  |  GitHub

  • Real-World Oracle Database and EBS Troubleshooting Guides

    A collection of real-world Oracle Database and EBS troubleshooting scenarios from production environments. These guides cover common issues encountered during day-to-day DBA operations.


    Resolving RMAN ORA-19502 Backup Failure Due to Disk Full

    RMAN backup job failed with the following errors:

    ORA-19502: write error on file
    ORA-27072: File I/O error
    Linux-x86_64 Error: 28: No space left on device

    Environment

    Oracle Database 19c, RMAN Disk Backup, Linux Environment

    Diagnosis

    Checked RMAN logs and identified the backup destination filesystem was full:

    -- Check filesystem usage
    df -h
    
    -- Check FRA usage from database
    SELECT name, space_limit/1024/1024/1024 limit_gb,
           space_used/1024/1024/1024 used_gb,
           ROUND(space_used/space_limit*100,2) pct_used
    FROM v$recovery_file_dest;

    The backup mount point reached 100% utilization.

    Resolution

    Cleaned older backup files after validating they were no longer needed, then reran the backup job:

    -- Use RMAN to safely clean up obsolete backups (preferred over OS delete)
    RMAN> CROSSCHECK BACKUP;
    RMAN> DELETE EXPIRED BACKUP;
    RMAN> DELETE OBSOLETE;

    After freeing space, the RMAN backup completed successfully.

    Key Lesson

    Always monitor backup destinations proactively. Schedule DELETE OBSOLETE as part of your regular RMAN maintenance to prevent FRA saturation.


    Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
    Connect: sdanwarahmed@gmail.com  |  LinkedIn