A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.
Initial Observations
Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.
The Turning Point
Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:
SELECT status,
ROUND(SUM(bytes)/1024/1024,2) MB
FROM dba_undo_extents
GROUP BY status;
EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.
Real Usage Calculation
SELECT ROUND(
SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
/ SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents;
Actual usage was only ~45% — not 97%.
Root Cause of the False Alert
The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.
The Correct Monitoring Approach
SELECT tablespace_name,
ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
/ SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents
GROUP BY tablespace_name;
A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.
This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.
This issue occurred in a Multitenant environment with Local UNDO enabled:
UNDO tablespace: APPS_UNDOTS1
Scope: PDB-level
Each PDB maintains its own UNDO
UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.
Investigation Approach
Rather than relying on % used, a state-based analysis was performed.
Step 1: Active Transactions
SELECT COUNT(*) FROM v$transaction;
Observation:
Minimal active transactions
No ongoing workload pressure
Step 2: UNDO Extent Analysis
SELECT status, SUM(bytes)/1024/1024 MB FROM dba_undo_extents GROUP BY status;
Key Observation
STATUS MB PCT ----------- -------- ------- ACTIVE ~3 MB ~0.01% UNEXPIRED ~51 MB ~0.18% EXPIRED ~28 GB ~99.81%
💡 Critical Insight
~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable
This means:
No real space pressure
No transaction risk
Tablespace is effectively available
Understanding UNDO Behavior
UNDO extents follow a lifecycle:
ACTIVE → UNEXPIRED → EXPIRED → REUSED
State
Description
Reusable
ACTIVE
Used by active transactions
❌
UNEXPIRED
Retained for consistency
⚠️
EXPIRED
No longer needed
✅
UNDO extents are not physically freed but logically reused by Oracle based on demand.
Why the Alert Was Misleading
The alert was based on:
SELECT * FROM dba_free_space;
Limitation
Shows only physically free space
Does not reflect reusable UNDO
Ignores Oracle’s internal reuse mechanism
Root Cause
A long-running concurrent request:
Ran for ~30 hours
Performed heavy DELETE operations
Generated significant UNDO
During execution:
UNDO reached ~99% → real pressure
After completion:
UNDO became EXPIRED
Space became reusable
Alert persisted → false positive
When UNDO Is Actually a Problem
ACTIVE > 20–30% UNEXPIRED very high EXPIRED very low
UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.
Recommended Monitoring Approach
WITH undo AS ( SELECT status, SUM(bytes)/1024/1024 mb FROM dba_undo_extents GROUP BY status ), total AS ( SELECT SUM(mb) total_mb FROM undo ) SELECT u.status, ROUND(u.mb,2) AS mb, ROUND((u.mb / t.total_mb) * 100, 2) AS pct FROM undo u, total t;
A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.
Level Insight
“UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”
During an Oracle E-Business Suite ADOP patching cycle in a multi-node environment, the apply phase failed on one node while completing successfully on others. Despite retries — including downtime mode — the issue persisted, pointing to a deeper inconsistency within the patching framework.
Symptoms Observed
ADOP session status: FAILED
Patch applied successfully on some nodes, failed on admin node
Repeated failures even with restart=no, abandon=yes, and downtime mode
No immediate actionable error from standard logs
Timeline of Events
T0 -- Patch execution initiated (ADOP apply phase)
T1 -- Failure observed on admin node
T2 -- Retry using downtime mode -- Failure persists
T3 -- ADOP session review shows inconsistent state
T4 -- Internal metadata tables analyzed
T5 -- Cleanup performed (tables + restart directory)
T6 -- Patch re-executed -- Success across all nodes
Investigation
Step 1: Check ADOP Session State
Query the ADOP session status to understand the current state across all nodes:
-- Check current ADOP session status
SELECT session_id, node_name, phase, status,
start_date, end_date
FROM applsys.ad_adop_sessions
ORDER BY start_date DESC;
-- Check apply phase status per node
SELECT s.session_id, n.node_name, p.phase_code,
p.status, p.start_date, p.end_date
FROM applsys.ad_adop_sessions s,
applsys.ad_adop_session_phases p,
applsys.fnd_nodes n
WHERE s.session_id = p.session_id
AND p.node_id = n.node_id
ORDER BY p.start_date DESC;
The existing session showed status FAILED with the apply phase partially completed — a clear indicator of inconsistent execution state across nodes.
Step 2: Check adalldefaults.txt
Reviewed the defaults file for any relevant configuration:
Modifying and retrying with this parameter had no impact, confirming the issue was not translation-related.
Step 3: Check Install Processes Table
-- Check for stale entries in FND_INSTALL_PROCESSES
SELECT COUNT(*) FROM applsys.fnd_install_processes;
-- View stale entries in detail
SELECT process_status, process_name, last_update_date
FROM applsys.fnd_install_processes
ORDER BY last_update_date DESC;
-- Check AD_DEFERRED_JOBS
SELECT COUNT(*) FROM applsys.ad_deferred_jobs;
SELECT * FROM applsys.ad_deferred_jobs;
Observation:FND_INSTALL_PROCESSES contained stale entries from the failed session. AD_DEFERRED_JOBS was empty.
Root Cause
The failure was caused by stale and inconsistent ADOP metadata tables — specifically APPLSYS.FND_INSTALL_PROCESSES and APPLSYS.AD_DEFERRED_JOBS. ADOP internally relies on these tables to track patch progress checkpoints, deferred job execution, and restart state management. When these tables retain entries from failed or incomplete sessions, ADOP assumes an incorrect execution state, leading to patch reconciliation failure, apply phase breakdown, and node-level inconsistencies.
Resolution Steps
Step 1: Backup Critical Tables
-- Always backup before any cleanup
CREATE TABLE applsys.fnd_install_processes_bak AS
SELECT * FROM applsys.fnd_install_processes;
CREATE TABLE applsys.ad_deferred_jobs_bak AS
SELECT * FROM applsys.ad_deferred_jobs;
-- Verify backups
SELECT COUNT(*) FROM applsys.fnd_install_processes_bak;
SELECT COUNT(*) FROM applsys.ad_deferred_jobs_bak;
Step 2: Drop Stale Metadata Tables
Dropping these tables forces ADOP to rebuild clean metadata during the next run:
DROP TABLE applsys.fnd_install_processes;
DROP TABLE applsys.ad_deferred_jobs;
Step 3: Reset the Restart Directory
The restart directory can silently preserve failure states. Back it up and create a fresh one:
The patch completed successfully across all nodes after the metadata cleanup.
Before vs After
Component
Before Fix
After Fix
ADOP Session
Failed
Successful
Node Consistency
Partial
Full
Restart Behavior
Stuck
Clean
Patch Execution
Incomplete
Completed
Key Takeaways
ADOP is state-driven — even when logs appear clean, internal metadata drives execution decisions
Partial success is a clue — if some nodes succeed and one fails, focus on local metadata, not the patch itself
The restart directory matters — it can silently preserve failure states and must be validated before retrying
Downtime mode is not a fix-all — even in downtime, ADOP still reads metadata tables; corruption persists unless cleaned
Always backup before cleanup — never drop tables without creating a backup first
When NOT to Use This Approach
Avoid applying this fix if the issue is caused by missing database patches (ETCC warnings), file system or permission issues, incorrect patch sequencing, or environment misconfiguration. Always validate the root cause before performing any metadata cleanup.
This scenario highlights a subtle but critical behavior in ADOP — sometimes patch failures are not caused by the patch itself, but by what the system remembers about past attempts. By resetting stale metadata, we allow ADOP to re-evaluate the environment cleanly, leading to successful execution.
In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.
Observed Symptoms
Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.
-- Identify blocking sessions
SELECT blocking_session, sid, wait_class, event
FROM v$session
WHERE blocking_session IS NOT NULL;
-- Detect inactive sessions with active transactions
SELECT s.sid, s.serial#, s.status, s.program,
s.username, t.start_time,
ROUND(s.last_call_et/3600,2) hrs_inactive
FROM v$session s, v$transaction t
WHERE s.saddr = t.ses_addr
AND s.status = 'INACTIVE'
ORDER BY hrs_inactive DESC;
-- Analyze full wait chain
SELECT * FROM dba_wait_chains;
The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.
Root Cause
An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.
-- The problematic SQL pattern
SELECT *
FROM WSH_DELIVERY_DETAILS
WHERE ROWID = :B1
FOR UPDATE NOWAIT;
Resolution
The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:
-- Kill blocking session (only after full validation and approval)
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.
Preventive Measures
Implement idle session timeout policies
Educate users on proper transaction handling in Oracle Forms
Review custom code using FOR UPDATE — keep transactions short and commit promptly
Monitor long-running and idle transactions proactively
Key Takeaways
An inactive session can still hold active transactions and critical locks
Always identify the root blocker — intermediate sessions are symptoms, not the cause
Application-level inactivity does not mean database-level inactivity
In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience. Connect: sdanwarahmed@gmail.com | LinkedIn
In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.
The Incident
An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.
-- Quick session count check
SELECT COUNT(*) FROM gv$session;
-- Result: 216 (already returning to normal)
-- Session breakdown by status
SELECT status, COUNT(*) cnt
FROM gv$session
GROUP BY status
ORDER BY cnt DESC;
Root Cause
Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:
Old Sessions (Inactive) + New Sessions (Active) = Session Surge
As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.
Recommendations
Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
Monitor inactive session trends to detect abnormal accumulation early
Configure appropriate idle timeout, maximum pool size, and session reuse settings
Key Takeaways
Not all session spikes are database problems — check middleware behavior first
High session count does not necessarily indicate database stress
Transient issues still require analysis as they reveal architectural inefficiencies
Database alerts can originate from upstream connection management behavior
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience. Connect: sdanwarahmed@gmail.com | LinkedIn
Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.
Observed Symptoms
ORA-00031: session marked for kill
Thread 1 cannot allocate new log
Private strand flush not complete
Phase 1: Interpreting ORA-00031 Correctly
ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.
Phase 2: Identifying the True Performance Signal
The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.
Phase 3: Evidence-Based Analysis (Read-Only)
Redo switch frequency was analyzed to validate system behavior:
SELECT
TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
COUNT(*) AS switches
FROM v$log_history
WHERE first_time > SYSDATE - 1
GROUP BY TRUNC(first_time, 'HH24')
ORDER BY 1;
Findings
Metric
Observation
Average Switch Rate
5-7 per hour
Peak Rate
8-10 per hour during business hours
Off-Peak Rate
1-3 per hour
A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.
Why No Changes Were Made
In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.
Recommendations
Continuously monitor redo switch frequency during peak windows
Use collected data to justify future redo log sizing via change management
Avoid unnecessary intervention when behavior is transient and non-impacting
Distinguish informational alert log messages from actionable errors
Key Takeaways
ORA-00031 is expected and harmless — it confirms session termination
Redo allocation waits are transient under sustained load
Proper analysis prevents unnecessary production intervention
Not all alert log warnings indicate failure — some are early signals of workload growth
The goal is not to eliminate every alert, but to understand which ones matter
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience. Connect: sdanwarahmed@gmail.com | LinkedIn
During routine monitoring, a critical issue was identified where the Fast Recovery Area (FRA) utilization reached 95%, putting the database at risk of ORA-19809, archiver process failure, and potential database hang. Despite having an RMAN retention policy configured, the FRA continued to grow uncontrollably.
Investigation
-- Check FRA usage
SELECT name, space_limit/1024/1024/1024 limit_gb,
space_used/1024/1024/1024 used_gb,
ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;
-- Check what is consuming space
SELECT file_type, percent_space_used, number_of_files
FROM v$flash_recovery_area_usage
ORDER BY percent_space_used DESC;
-- Review RMAN configuration
SHOW ALL;
Findings: backupsets were consuming the majority of FRA space. RMAN had a 7-day retention policy configured but backup optimization was disabled and archivelog deletion policy was not set.
Root Cause
The retention policy was defined but obsolete backups were not being automatically deleted. There was no scheduled execution of DELETE OBSOLETE, backup optimization was disabled causing redundant data, and archivelog lifecycle was not controlled. This led to continuous accumulation of backupsets and archivelogs.
Immediate Fix
-- Run RMAN cleanup
RMAN> CROSSCHECK BACKUP;
RMAN> DELETE EXPIRED BACKUP;
RMAN> DELETE OBSOLETE;
-- Verify FRA usage after cleanup
SELECT ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;
FRA utilization reduced from 95% to 83% after cleanup.
Configuration Fixes
-- Enable backup optimization
CONFIGURE BACKUP OPTIMIZATION ON;
-- Set archivelog deletion policy
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO DISK;
-- Enable compressed backups
CONFIGURE DEVICE TYPE DISK BACKUP TYPE TO COMPRESSED BACKUPSET;
Automation
Scheduled daily execution of DELETE OBSOLETE via cron:
In Oracle E-Business Suite (EBS) environments, login failures are often perceived as simple application issues. However, in complex architectures, they can originate from multiple interacting layers across the application and middleware stack.
In this blog, I’ll walk through a real-world production incident where an OACORE JVM issue combined with WebLogic security filtering resulted in complete login inaccessibility.
This case highlights the importance of analyzing both performance and security layers together when troubleshooting critical application outages.
⚠️ Issue Summary
Users were unable to access the EBS login page
Pages were hanging or not loading
WebLogic console reported:
Connection rejected, filter blocked Socket
🔍 Initial Observation
From the application server:
Load Average: ~10+
👉 This indicated:
High CPU utilization
System under heavy stress
Potential JVM thread contention
🔬 Detailed Analysis
One of the OACORE managed server JVMs became unresponsive
Long-running threads caused thread pool exhaustion
Incoming user requests began queueing
At the same time:
WebLogic connection filter was actively enforcing access rules
Legitimate requests were being rejected under stressed conditions
🧠 Understanding the Components
OACORE (Application Layer)
Handles:
Login requests
Forms processing
Core application logic
If JVM threads are exhausted: 👉 Requests queue → login hangs
If misconfigured or stressed: 👉 Legitimate traffic may be blocked
🎯 Root Cause Analysis (RCA)
The login issue was not caused by a single failure point, but by a combination of application tier resource exhaustion and restrictive middleware-level access control.
High CPU utilization and long-running threads caused one OACORE JVM to become unresponsive
Thread pool exhaustion led to request queuing, preventing new login requests from being processed
Simultaneously:
The WebLogic connection filter (EBSConnectionFilterImpl) enforced strict access control policies
Under high load conditions, legitimate client requests were rejected with “filter blocked Socket”
This interaction between performance degradation and security enforcement amplified the impact, resulting in complete login inaccessibility despite partial system availability.
🛠️ Resolution Approach (Controlled & Safe)
The resolution approach focused on stabilizing the JVM layer while validating and correcting middleware-level access controls in a controlled manner.
🔹 Step 1: Identify Unresponsive JVM
ps -ef | grep oacore
✔ Identify JVM with abnormal CPU or stuck behavior
🔹 Step 2: Handle Stuck JVM (Controlled Action)
⚠️ Important Note:
Forcefully terminating JVM processes should NOT be performed without validation.
✔ Recommended Approach:
Confirm the process is unresponsive
Ensure no critical transactions are running
Prefer controlled shutdown where possible
✔ Example (Only if fully unresponsive and approved):
kill -9 <PID>
👉 Node Manager can restart the JVM automatically after termination
Avoid forceful JVM termination without proper validation
Do not disable security controls without understanding impact
Follow change management procedures in production
Coordinate with application and security teams before changes
🏁 Conclusion
This incident demonstrates that login failures in Oracle EBS are not always isolated to a single component but can result from complex interactions across application performance and middleware security layers.
The combination of JVM resource exhaustion and connection filtering behavior created a compounded failure scenario, leading to complete login disruption.
A structured, multi-layer troubleshooting approach—focused on performance, configuration, and governance—enabled effective resolution while minimizing risk.
This reinforces the importance of analyzing both system behavior and security controls together when addressing critical production incidents.
💡 Pro Tip
When troubleshooting Oracle EBS login issues, always validate both:
In Oracle E-Business Suite (EBS) environments, application tier stability is critical to ensure seamless user experience. However, there are scenarios where managed servers behave unexpectedly and require manual intervention. This post walks through a real-world production issue where an OACORE managed server entered a FAILED_NOT_RESTARTABLE state, its impact, root cause analysis, and how it was resolved.
Environment Details
Oracle E-Business Suite: R12.2.x
Application Tier: WebLogic Managed Servers
Component Impacted: OACORE Server (oacore_server1)
Environment Type: Production
Problem Statement
An alert was received indicating oacore_server1 was in FAILED_NOT_RESTARTABLE state. Upon verification, the server was running but Node Manager could not auto-restart it.
Key Observations
Despite the OACORE server being in a failed state, the application remained accessible and functional — traffic was being handled by other OACORE servers. This is due to the multi-OACORE architecture with load balancing via OHS/Web tier. However, this creates a hidden risk: load redistribution increases pressure on remaining servers and can lead to cascading failures if not addressed promptly.
Detailed Analysis
Managed server restart attempts failed during initialization
Bulk concurrent requests were actively running
CPU utilization spiked on the application tier
JVM resources were under pressure
Understanding FAILED_NOT_RESTARTABLE
In Oracle WebLogic Server, a managed server is marked as FAILED_NOT_RESTARTABLE after repeated unsuccessful restart attempts. This is a protective mechanism designed to prevent unstable restart loops when the server cannot recover successfully.
Root Cause Analysis
The OACORE managed server entered FAILED_NOT_RESTARTABLE state due to repeated startup failures following an unclean or resource-constrained shutdown. High CPU utilization and heavy concurrent workload placed JVM resources under pressure, preventing a clean restart cycle. Residual runtime artifacts (such as incomplete shutdown state or resource locks) prevented successful reinitialization, causing WebLogic to mark the server as FAILED_NOT_RESTARTABLE.
Resolution
cd $ADMIN_SCRIPTS_HOME
./admanagedsrvctl.sh stop oacore_server1
./admanagedsrvctl.sh start oacore_server1
After the controlled restart, the server returned to RUNNING state with all deployments active and the application stable.
Identify Inactive Forms Sessions
Inactive sessions holding resources can contribute to JVM pressure. Use this query to identify them safely — do not terminate without proper validation and approvals:
SELECT s.sid,
s.serial#,
s.username,
s.status,
s.program,
s.machine,
ROUND(s.last_call_et/3600,2) AS hours_inactive
FROM v$session s
WHERE s.status = 'INACTIVE'
AND s.username = 'APPS'
AND s.program LIKE 'frmweb%'
AND s.last_call_et > 28800 -- 8 hours
ORDER BY hours_inactive DESC;
Reference only — do NOT execute without validation:
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
Automate Session Monitoring
Use this script to monitor inactive sessions every 8 hours via cron:
#!/bin/bash
export ORACLE_SID=your_sid
export ORACLE_HOME=/path/to/oracle_home
export PATH=$ORACLE_HOME/bin:$PATH
sqlplus -s / as sysdba <<EOF
SET LINES 200
SET PAGES 200
SELECT COUNT(*) AS inactive_sessions
FROM v\$session
WHERE status='INACTIVE'
AND username='APPS'
AND program LIKE 'frmweb%'
AND last_call_et > 28800;
EXIT;
EOF
In day-to-day Oracle Database and E-Business Suite (EBS) administration, proactive monitoring is critical to ensure system stability, performance, and availability. As part of my real-world DBA experience, I created a set of health check scripts to quickly assess the status of key database components.
Objective
The goal of these scripts is to provide a quick and efficient health check covering database status, instance status, tablespace usage, invalid objects, and RMAN backup status — all from a single SQL*Plus session.
Solution Overview
I developed a SQL-based script (hc.sql) that gathers essential health metrics from Oracle dynamic views and DBA tables. The scripts are designed to be lightweight, fast, and provide immediate actionable insights.
SELECT NAME, OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;
Tablespace Usage
SELECT tablespace_name,
ROUND(used_space * 8192 / 1024 / 1024, 2) used_mb,
ROUND(tablespace_size * 8192 / 1024 / 1024, 2) total_mb,
ROUND(used_percent, 2) pct_used
FROM dba_tablespace_usage_metrics
ORDER BY used_percent DESC;
Active Sessions
SELECT COUNT(*) active_sessions FROM V$SESSION WHERE STATUS='ACTIVE';
-- Session breakdown by status
SELECT status, COUNT(*) cnt
FROM v$session
GROUP BY status
ORDER BY cnt DESC;
Invalid Objects
SELECT owner, object_type, COUNT(*) cnt
FROM dba_objects
WHERE status = 'INVALID'
GROUP BY owner, object_type
ORDER BY cnt DESC;
RMAN Backup Status
SELECT status, start_time, end_time,
ROUND((output_bytes/1024/1024/1024),2) output_gb
FROM V$RMAN_BACKUP_JOB_DETAILS
ORDER BY start_time DESC
FETCH FIRST 5 ROWS ONLY;
Real-World Value
These scripts are derived from real-time production support scenarios where quick diagnosis is required during high CPU issues, backup failures, tablespace alerts, and performance degradation. Having a single script to validate system health saves valuable time during critical situations.
Future Enhancements
I plan to enhance this repository further by adding shell automation scripts, alerting mechanisms, integration with monitoring tools, and additional EBS-specific checks.
Conclusion
A well-designed health check script is an essential tool for every DBA. It not only helps in proactive monitoring but also ensures faster troubleshooting and improved system reliability. Feel free to explore the repository, use the scripts, and share your feedback.