Author: SYED ANWAR AHMED

Rethinking UNDO Tablespace Monitoring: When 97% Usage Is NOT a Problem
A critical alert was triggered in production showing UNDO tablespace usage at 97% with near-zero free space. At first glance, this looked like an immediate outage risk. However, deeper analysis revealed a completely different story.

Initial Observations

Standard checks showed Tablespace UNDOTBS1 with a total size of 16 GB and only ~2 MB free space. This strongly suggested the UNDO tablespace was nearly full. However, deeper analysis revealed a different story.

The Turning Point

Instead of relying on DBA_FREE_SPACE, we analyzed UNDO internals using extent status:
```
SELECT status,
       ROUND(SUM(bytes)/1024/1024,2) MB
FROM dba_undo_extents
GROUP BY status;
```
Results showed: ACTIVE ~0 MB, UNEXPIRED ~7.3 GB, EXPIRED ~8.6 GB.

Key Insight

EXPIRED UNDO extents are fully reusable by Oracle. This means ~55% of the tablespace was immediately reusable and no actual space pressure existed.

Real Usage Calculation
```
SELECT ROUND(
       SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
       / SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents;
```
Actual usage was only ~45% — not 97%.

Root Cause of the False Alert

The monitoring system was using DBA_FREE_SPACE which does not include reusable EXPIRED extents. This is incorrect for UNDO tablespaces and leads to false critical alerts, unnecessary escalations, and wasted DBA effort.

The Correct Monitoring Approach
```
SELECT tablespace_name,
       ROUND(SUM(CASE WHEN status IN ('ACTIVE','UNEXPIRED') THEN bytes ELSE 0 END)
/ SUM(bytes) * 100, 2) actual_used_pct
FROM dba_undo_extents
GROUP BY tablespace_name;
```
Only alert when actual usage exceeds 90% AND long-running transactions exist. The full script is available on GitLab: gitlab.com/sdanwarahmed/oracle-dba-scripts

Key Takeaways
- UNDO is not a regular tablespace — it has its own lifecycle
- DBA_FREE_SPACE is misleading for UNDO monitoring
- Always analyze UNDO extents by status: ACTIVE, UNEXPIRED, EXPIRED
- Build context-aware monitoring that aligns with Oracle internals
- Not all “97% full” alerts are real problems — sometimes it’s just the wrong metric
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn | GitLab
30/03/2026
Rethinking UNDO Tablespace Monitoring in Multitenant Environments: A Case Study on APPS_UNDOTS1
Introduction

A production alert indicating UNDO tablespace usage at ~99% is typically treated as a critical issue. However, in Oracle Multitenant environments with local UNDO enabled, this metric can be misleading when evaluated using traditional tablespace monitoring techniques.

This article presents a real-world scenario involving the PDB-level UNDO tablespace APPS_UNDOTS1, where utilization reached ~99%, yet no actual resource contention existed.

The Incident

An alert was triggered showing:
```
Tablespace              TOTAL_MB     USED_MB     FREE_MB PCT_USED
-------------------- ----------- ----------- ----------- --------
APPS_UNDOTS1           28,863.25   28,681.56      181.69    99.37
```
At first glance, this indicated:
- UNDO nearly full
- Minimal free space
- Risk of ORA-30036 / ORA-01555
Environment Context

This issue occurred in a Multitenant environment with Local UNDO enabled:
- UNDO tablespace: APPS_UNDOTS1
- Scope: PDB-level
- Each PDB maintains its own UNDO
UNDO behavior must always be analyzed at the PDB level, not just at the CDB level.

Investigation Approach

Rather than relying on % used, a state-based analysis was performed.

Step 1: Active Transactions
```
SELECT COUNT(*) FROM v$transaction;
```
Observation:
- Minimal active transactions
- No ongoing workload pressure
Step 2: UNDO Extent Analysis
```
SELECT status, SUM(bytes)/1024/1024 MB
FROM dba_undo_extents
GROUP BY status;
```
Key Observation
```
STATUS        MB         PCT
-----------   --------   -------
ACTIVE        ~3 MB      ~0.01%
UNEXPIRED     ~51 MB     ~0.18%
EXPIRED       ~28 GB     ~99.81%
```
💡 Critical Insight

~99.81% of UNDO (APPS_UNDOTS1) was EXPIRED and fully reusable

This means:
- No real space pressure
- No transaction risk
- Tablespace is effectively available
Understanding UNDO Behavior

UNDO extents follow a lifecycle:
```
ACTIVE → UNEXPIRED → EXPIRED → REUSED
```
State Description Reusable
ACTIVE Used by active transactions ❌
UNEXPIRED Retained for consistency ⚠️
EXPIRED No longer needed ✅

UNDO extents are not physically freed but logically reused by Oracle based on demand.

Why the Alert Was Misleading

The alert was based on:
```
SELECT * FROM dba_free_space;
```
Limitation
- Shows only physically free space
- Does not reflect reusable UNDO
- Ignores Oracle’s internal reuse mechanism
Root Cause

A long-running concurrent request:
- Ran for ~30 hours
- Performed heavy DELETE operations
- Generated significant UNDO
During execution:
- UNDO reached ~99% → real pressure
After completion:
- UNDO became EXPIRED
- Space became reusable
- Alert persisted → false positive
When UNDO Is Actually a Problem
```
ACTIVE      > 20–30%
UNEXPIRED   very high
EXPIRED     very low
```
UNDO becomes a real issue only when both EXPIRED and UNEXPIRED extents are exhausted and no reusable space remains.

Recommended Monitoring Approach
```
WITH undo AS (
    SELECT status, SUM(bytes)/1024/1024 mb
    FROM dba_undo_extents
    GROUP BY status
),
total AS (
    SELECT SUM(mb) total_mb FROM undo
)
SELECT 
    u.status,
    ROUND(u.mb,2) AS mb,
    ROUND((u.mb / t.total_mb) * 100, 2) AS pct
FROM undo u, total t;
```
Improved Alert Logic
- ACTIVE > 20% → Critical
- UNEXPIRED > 80% → Warning
- Otherwise → Normal
Key Takeaways
- UNDO is state-driven, not space-driven
- EXPIRED undo = reusable capacity
- % used is misleading
- Always analyze at PDB level in multitenant setups
- Monitoring must align with Oracle internals
Conclusion

UNDO tablespace monitoring must evolve from:

❌ Static space-based metrics
➡️
✅ Dynamic lifecycle-based analysis

A tablespace showing 99% utilization can still be completely healthy if most of its extents are reusable.

Level Insight

“UNDO is not a storage problem — it is a lifecycle problem. The difference between false alarms and accurate diagnosis lies in understanding how Oracle transitions undo extents.”

Final Thought

👉 Don’t ask:
“How full is UNDO?”

👉 Ask:
“How much of it is actually in use?”

Author

Syed Anwar Ahmed
Oracle Apps DBA | Oracle EBS | Performance & Troubleshooting
27/03/2026
When ADOP Remembers Too Much: Fixing Patch Failures Caused by Stale Metadata in Oracle EBS
During an Oracle E-Business Suite ADOP patching cycle in a multi-node environment, the apply phase failed on one node while completing successfully on others. Despite retries — including downtime mode — the issue persisted, pointing to a deeper inconsistency within the patching framework.

Symptoms Observed
- ADOP session status: FAILED
- Patch applied successfully on some nodes, failed on admin node
- Repeated failures even with restart=no, abandon=yes, and downtime mode
- No immediate actionable error from standard logs
Timeline of Events
```
T0 -- Patch execution initiated (ADOP apply phase)
T1 -- Failure observed on admin node
T2 -- Retry using downtime mode -- Failure persists
T3 -- ADOP session review shows inconsistent state
T4 -- Internal metadata tables analyzed
T5 -- Cleanup performed (tables + restart directory)
T6 -- Patch re-executed -- Success across all nodes
```
Investigation

Step 1: Check ADOP Session State

Query the ADOP session status to understand the current state across all nodes:
```
-- Check current ADOP session status
SELECT session_id, node_name, phase, status,
       start_date, end_date
FROM applsys.ad_adop_sessions
ORDER BY start_date DESC;

-- Check apply phase status per node
SELECT s.session_id, n.node_name, p.phase_code,
       p.status, p.start_date, p.end_date
FROM applsys.ad_adop_sessions s,
     applsys.ad_adop_session_phases p,
     applsys.fnd_nodes n
WHERE s.session_id = p.session_id
AND p.node_id = n.node_id
ORDER BY p.start_date DESC;
```
The existing session showed status FAILED with the apply phase partially completed — a clear indicator of inconsistent execution state across nodes.

Step 2: Check adalldefaults.txt

Reviewed the defaults file for any relevant configuration:
```
cat $APPL_TOP/admin/$TWO_TASK/adalldefaults.txt | grep -i missing
-- Key parameter found:
-- MISSING_TRANSLATED_VERSION = No
```
Modifying and retrying with this parameter had no impact, confirming the issue was not translation-related.

Step 3: Check Install Processes Table
```
-- Check for stale entries in FND_INSTALL_PROCESSES
SELECT COUNT(*) FROM applsys.fnd_install_processes;

-- View stale entries in detail
SELECT process_status, process_name, last_update_date
FROM applsys.fnd_install_processes
ORDER BY last_update_date DESC;

-- Check AD_DEFERRED_JOBS
SELECT COUNT(*) FROM applsys.ad_deferred_jobs;
SELECT * FROM applsys.ad_deferred_jobs;
```
Observation: FND_INSTALL_PROCESSES contained stale entries from the failed session. AD_DEFERRED_JOBS was empty.

Root Cause

The failure was caused by stale and inconsistent ADOP metadata tables — specifically APPLSYS.FND_INSTALL_PROCESSES and APPLSYS.AD_DEFERRED_JOBS. ADOP internally relies on these tables to track patch progress checkpoints, deferred job execution, and restart state management. When these tables retain entries from failed or incomplete sessions, ADOP assumes an incorrect execution state, leading to patch reconciliation failure, apply phase breakdown, and node-level inconsistencies.

Resolution Steps

Step 1: Backup Critical Tables
```
-- Always backup before any cleanup
CREATE TABLE applsys.fnd_install_processes_bak AS
SELECT * FROM applsys.fnd_install_processes;

CREATE TABLE applsys.ad_deferred_jobs_bak AS
SELECT * FROM applsys.ad_deferred_jobs;

-- Verify backups
SELECT COUNT(*) FROM applsys.fnd_install_processes_bak;
SELECT COUNT(*) FROM applsys.ad_deferred_jobs_bak;
```
Step 2: Drop Stale Metadata Tables

Dropping these tables forces ADOP to rebuild clean metadata during the next run:
```
DROP TABLE applsys.fnd_install_processes;
DROP TABLE applsys.ad_deferred_jobs;
```
Step 3: Reset the Restart Directory

The restart directory can silently preserve failure states. Back it up and create a fresh one:
```
cd $APPL_TOP/admin/$TWO_TASK

-- Backup existing restart directory
mv restart restart_bkp_$(date +%Y%m%d)

-- Create fresh restart directory
mkdir restart

-- Verify
ls -la | grep restart
```
Step 4: Re-run the Patch
```
adop phase=apply \
     patches=<patch_id> \
     restart=no \
     abandon=yes \
     apply_mode=downtime
```
The patch completed successfully across all nodes after the metadata cleanup.

Before vs After

Component Before Fix After Fix
ADOP Session Failed Successful
Node Consistency Partial Full
Restart Behavior Stuck Clean
Patch Execution Incomplete Completed

Key Takeaways
- ADOP is state-driven — even when logs appear clean, internal metadata drives execution decisions
- Partial success is a clue — if some nodes succeed and one fails, focus on local metadata, not the patch itself
- The restart directory matters — it can silently preserve failure states and must be validated before retrying
- Downtime mode is not a fix-all — even in downtime, ADOP still reads metadata tables; corruption persists unless cleaned
- Always backup before cleanup — never drop tables without creating a backup first
When NOT to Use This Approach

Avoid applying this fix if the issue is caused by missing database patches (ETCC warnings), file system or permission issues, incorrect patch sequencing, or environment misconfiguration. Always validate the root cause before performing any metadata cleanup.

This scenario highlights a subtle but critical behavior in ADOP — sometimes patch failures are not caused by the patch itself, but by what the system remembers about past attempts. By resetting stale metadata, we allow ADOP to re-evaluate the environment cleanly, leading to successful execution.

Have questions or faced a similar issue? Reach out at sdanwarahmed@gmail.com.
27/03/2026
Oracle EBS Performance Issue: Inactive Forms Session Holding TX Locks (RCA and Resolution)
In Oracle E-Business Suite (EBS) environments, performance issues are often attributed to high workload or system resource constraints. However, some of the most critical slowdowns originate from less obvious sources — inactive sessions holding uncommitted transactions. This post walks through a real-world production incident where an inactive Oracle Forms session caused cascading blocking across multiple users due to TX row-level locks.

Observed Symptoms
- Oracle Forms screens becoming unresponsive in Order Management and Shipping modules
- Concurrent programs stuck in running state
- Increased database wait event: enq: TX - row lock contention
Investigation

Analysis of v$session, v$transaction, and dba_wait_chains revealed a single inactive Oracle Forms session (frmweb) holding an active transaction with multiple downstream sessions waiting on TX row lock contention.
```
-- Identify blocking sessions
SELECT blocking_session, sid, wait_class, event
FROM v$session
WHERE blocking_session IS NOT NULL;

-- Detect inactive sessions with active transactions
SELECT s.sid, s.serial#, s.status, s.program,
       s.username, t.start_time,
       ROUND(s.last_call_et/3600,2) hrs_inactive
FROM v$session s, v$transaction t
WHERE s.saddr = t.ses_addr
AND s.status = 'INACTIVE'
ORDER BY hrs_inactive DESC;

-- Analyze full wait chain
SELECT * FROM dba_wait_chains;
```
The root blocking session showed STATUS = INACTIVE and EVENT = 'SQL*Net message from client' but had an active transaction in v$transaction — confirming it was idle at the application level but actively holding locks at the database level.

Root Cause

An Oracle Forms session executed a SELECT ... FOR UPDATE NOWAIT on WSH_DELIVERY_DETAILS, then became idle without committing or rolling back. This held exclusive row locks that blocked other sessions attempting to access the same rows, creating a cascading blocking chain.
```
-- The problematic SQL pattern
SELECT *
FROM WSH_DELIVERY_DETAILS
WHERE ROWID = :B1
FOR UPDATE NOWAIT;
```
Resolution

The root blocking session was identified, verified to have no active business transactions, approvals were obtained, and the session was terminated:
```
-- Kill blocking session (only after full validation and approval)
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
```
Locks were released immediately, the blocking chain resolved, and application responsiveness was restored.

Preventive Measures
- Implement idle session timeout policies
- Educate users on proper transaction handling in Oracle Forms
- Review custom code using FOR UPDATE — keep transactions short and commit promptly
- Monitor long-running and idle transactions proactively
Key Takeaways
- An inactive session can still hold active transactions and critical locks
- Always identify the root blocker — intermediate sessions are symptoms, not the cause
- Application-level inactivity does not mean database-level inactivity
- In Oracle EBS, the most disruptive issues are often caused by inactive sessions holding uncommitted transactions
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Oracle Database Session Spike Mystery: When Connection Pools Collide
In a production Oracle database environment, a sudden spike in session count exceeding 1000+ sessions triggered alerts and concern. Interestingly, the system recovered automatically without any database-level intervention. At first glance this appeared to be a database issue — but deeper analysis revealed a different story.

The Incident

An automated alert reported session count exceeding threshold (1000+), with the majority in INACTIVE state from middleware connection pool accounts. Despite the spike there were no blocking sessions, no performance degradation, and no database errors.
```
-- Quick session count check
SELECT COUNT(*) FROM gv$session;
-- Result: 216 (already returning to normal)

-- Session breakdown by status
SELECT status, COUNT(*) cnt
FROM gv$session
GROUP BY status
ORDER BY cnt DESC;
```
Root Cause

Multiple production mid-tier servers simultaneously created new connection pools at the same time window. New pools created new database sessions while existing pools kept their sessions alive (INACTIVE) pending graceful termination — resulting in a temporary overlap:
```
Old Sessions (Inactive) + New Sessions (Active) = Session Surge
```
As older pools were cleaned up, inactive sessions terminated automatically and the count returned to baseline. This was not a database problem — it was connection pool lifecycle behavior in the mid-tier layer.

Recommendations
- Stagger connection pool refresh across mid-tier servers to avoid simultaneous spikes
- Monitor inactive session trends to detect abnormal accumulation early
- Configure appropriate idle timeout, maximum pool size, and session reuse settings
Key Takeaways
- Not all session spikes are database problems — check middleware behavior first
- High session count does not necessarily indicate database stress
- Transient issues still require analysis as they reveal architectural inefficiencies
- Database alerts can originate from upstream connection management behavior
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Oracle Alert Log Deep Dive: Interpreting ORA-00031 and Redo Log Pressure Without Production Changes
Production alert logs often contain messages that appear critical but are, in reality, indicators of normal database behavior under load. This article presents a real-world Oracle database investigation where repeated ORA-00031: session marked for kill messages and redo log allocation waits were observed. Using read-only analysis techniques, we demonstrate how to distinguish between expected behavior and actionable signals without performing any intrusive changes.

Observed Symptoms
```
ORA-00031: session marked for kill
Thread 1 cannot allocate new log
Private strand flush not complete
```
Phase 1: Interpreting ORA-00031 Correctly

ORA-00031 is generated when sessions are terminated using ALTER SYSTEM KILL SESSION. Oracle marks the session for cleanup and handles it asynchronously via background processes. This is not an error — it is a confirmation of successful session termination.

Phase 2: Identifying the True Performance Signal

The more critical messages were Thread 1 cannot allocate new log and Private strand flush not complete. These occur when LGWR attempts a redo log switch but active redo strands are still flushing. Oracle briefly delays the log switch until consistency is ensured — this is a redo allocation wait, typically seen under sustained transactional load.

Phase 3: Evidence-Based Analysis (Read-Only)

Redo switch frequency was analyzed to validate system behavior:
```
SELECT
    TO_CHAR(TRUNC(first_time, 'HH24'), 'YYYY-MM-DD HH24:MI') AS switch_hour,
    COUNT(*) AS switches
FROM v$log_history
WHERE first_time > SYSDATE - 1
GROUP BY TRUNC(first_time, 'HH24')
ORDER BY 1;
```
Findings

Metric Observation
Average Switch Rate 5-7 per hour
Peak Rate 8-10 per hour during business hours
Off-Peak Rate 1-3 per hour

A direct correlation was observed between log switch spikes and high DML activity, confirming a cause-effect relationship rather than random errors.

Why No Changes Were Made

In this scenario, production environment restrictions were in place, no user impact was observed, and the behavior was transient and self-resolving. A monitoring-first approach was adopted instead of immediate tuning.

Recommendations
- Continuously monitor redo switch frequency during peak windows
- Use collected data to justify future redo log sizing via change management
- Avoid unnecessary intervention when behavior is transient and non-impacting
- Distinguish informational alert log messages from actionable errors
Key Takeaways
- ORA-00031 is expected and harmless — it confirms session termination
- Redo allocation waits are transient under sustained load
- Proper analysis prevents unnecessary production intervention
- Not all alert log warnings indicate failure — some are early signals of workload growth
- The goal is not to eliminate every alert, but to understand which ones matter
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
26/03/2026
Resolving Critical FRA Space Issue Using RMAN Optimization (Real-World DBA Scenario)
During routine monitoring, a critical issue was identified where the Fast Recovery Area (FRA) utilization reached 95%, putting the database at risk of ORA-19809, archiver process failure, and potential database hang. Despite having an RMAN retention policy configured, the FRA continued to grow uncontrollably.

Investigation
```
-- Check FRA usage
SELECT name, space_limit/1024/1024/1024 limit_gb,
       space_used/1024/1024/1024 used_gb,
       ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;

-- Check what is consuming space
SELECT file_type, percent_space_used, number_of_files
FROM v$flash_recovery_area_usage
ORDER BY percent_space_used DESC;

-- Review RMAN configuration
SHOW ALL;
```
Findings: backupsets were consuming the majority of FRA space. RMAN had a 7-day retention policy configured but backup optimization was disabled and archivelog deletion policy was not set.

Root Cause

The retention policy was defined but obsolete backups were not being automatically deleted. There was no scheduled execution of DELETE OBSOLETE, backup optimization was disabled causing redundant data, and archivelog lifecycle was not controlled. This led to continuous accumulation of backupsets and archivelogs.

Immediate Fix
```
-- Run RMAN cleanup
RMAN> CROSSCHECK BACKUP;
RMAN> DELETE EXPIRED BACKUP;
RMAN> DELETE OBSOLETE;

-- Verify FRA usage after cleanup
SELECT ROUND(space_used/space_limit*100,2) pct_used
FROM v$recovery_file_dest;
```
FRA utilization reduced from 95% to 83% after cleanup.

Configuration Fixes
```
-- Enable backup optimization
CONFIGURE BACKUP OPTIMIZATION ON;

-- Set archivelog deletion policy
CONFIGURE ARCHIVELOG DELETION POLICY TO BACKED UP 1 TIMES TO DISK;

-- Enable compressed backups
CONFIGURE DEVICE TYPE DISK BACKUP TYPE TO COMPRESSED BACKUPSET;
```
Automation

Scheduled daily execution of DELETE OBSOLETE via cron:
```
-- Add to RMAN backup script
DELETE NOPROMPT OBSOLETE;

# Crontab entry for daily cleanup
0 2 * * * /path/to/rman_cleanup.sh >> /tmp/rman_cleanup.log 2>&1
```
Key Takeaways
- RMAN retention policy does not enforce deletion automatically — you must schedule DELETE OBSOLETE
- Backup lifecycle management must include automation
- Monitor FRA usage proactively to prevent saturation
- Enable backup optimization to avoid redundant data
- Set archivelog deletion policy to control archivelog lifecycle
Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn
25/03/2026
Oracle EBS Login Issue: Real-Time Production Incident and Fix (OACORE + WebLogic Filter)
📌 Introduction

In Oracle E-Business Suite (EBS) environments, login failures are often perceived as simple application issues. However, in complex architectures, they can originate from multiple interacting layers across the application and middleware stack.

In this blog, I’ll walk through a real-world production incident where an OACORE JVM issue combined with WebLogic security filtering resulted in complete login inaccessibility.

This case highlights the importance of analyzing both performance and security layers together when troubleshooting critical application outages.

⚠️ Issue Summary
- Users were unable to access the EBS login page
- Pages were hanging or not loading
- WebLogic console reported:
```
Connection rejected, filter blocked Socket
```
🔍 Initial Observation

From the application server:
- Load Average: ~10+
👉 This indicated:
- High CPU utilization
- System under heavy stress
- Potential JVM thread contention
🔬 Detailed Analysis
- One of the OACORE managed server JVMs became unresponsive
- Long-running threads caused thread pool exhaustion
- Incoming user requests began queueing
At the same time:
- WebLogic connection filter was actively enforcing access rules
- Legitimate requests were being rejected under stressed conditions
🧠 Understanding the Components

OACORE (Application Layer)

Handles:
- Login requests
- Forms processing
- Core application logic
If JVM threads are exhausted:
👉 Requests queue → login hangs

Oracle WebLogic Server Connection Filter

EBS environments may use:
```
oracle.apps.ad.tools.configuration.wls.filter.EBSConnectionFilterImpl
```
This filter:
- Enforces IP-based access control
- Overrides default allow rules
If misconfigured or stressed:
👉 Legitimate traffic may be blocked

🎯 Root Cause Analysis (RCA)

The login issue was not caused by a single failure point, but by a combination of application tier resource exhaustion and restrictive middleware-level access control.
- High CPU utilization and long-running threads caused one OACORE JVM to become unresponsive
- Thread pool exhaustion led to request queuing, preventing new login requests from being processed
Simultaneously:
- The WebLogic connection filter (EBSConnectionFilterImpl) enforced strict access control policies
- Under high load conditions, legitimate client requests were rejected with “filter blocked Socket”
This interaction between performance degradation and security enforcement amplified the impact, resulting in complete login inaccessibility despite partial system availability.

🛠️ Resolution Approach (Controlled & Safe)

The resolution approach focused on stabilizing the JVM layer while validating and correcting middleware-level access controls in a controlled manner.

🔹 Step 1: Identify Unresponsive JVM
```
ps -ef | grep oacore
```
✔ Identify JVM with abnormal CPU or stuck behavior

🔹 Step 2: Handle Stuck JVM (Controlled Action)

⚠️ Important Note:

Forcefully terminating JVM processes should NOT be performed without validation.

✔ Recommended Approach:
- Confirm the process is unresponsive
- Ensure no critical transactions are running
- Prefer controlled shutdown where possible
✔ Example (Only if fully unresponsive and approved):
```
kill -9 <PID>
```
👉 Node Manager can restart the JVM automatically after termination

🔹 Step 3: Rolling Restart of OACORE
```
admanagedsrvctl.sh stop oacore_server1
admanagedsrvctl.sh start oacore_server1
```
✔ Ensures clean JVM state
✔ Restores thread pool balance

🔹 Step 4: Validate WebLogic Connection Filter

⚠️ Important Note:

The WebLogic connection filter is a security control and should NOT be disabled permanently.

✔ In this case:
- Filter behavior was validated as part of troubleshooting
- Temporary relaxation was used to confirm impact
✔ Recommended Approach:
- Review allowed IP ranges
- Validate filter configuration
- Re-enable filter after correction
The filter was re-enabled after validation and correction of configuration.

🔹 Step 5: Restart Admin Server (If Required)
```
adadminsrvctl.sh stop
adadminsrvctl.sh start
```
✔ Ensures configuration changes are applied

✅ Final Outcome
- System load reduced from ~10 → ~1.5
- OACORE JVMs stabilized
- Login page restored
- Users successfully accessed the application
📊 Key Learnings
1. OACORE JVM issues can manifest as complete application outages due to thread exhaustion and request queuing.
2. WebLogic connection filters can unintentionally block legitimate traffic, especially under high load conditions.
3. Multi-layer failures (application + middleware + security) significantly amplify incident impact.
4. System load metrics (CPU, load average) provide early indicators of JVM stress and should be monitored proactively.
🧩 Preventive Measures
- Monitor JVM thread utilization proactively
- Review connection filter configurations periodically
- Avoid peak concurrent load spikes
- Implement alerting for high CPU / load conditions
- Validate session and request handling patterns
🔐 Governance Considerations
- Avoid forceful JVM termination without proper validation
- Do not disable security controls without understanding impact
- Follow change management procedures in production
- Coordinate with application and security teams before changes
🏁 Conclusion

This incident demonstrates that login failures in Oracle EBS are not always isolated to a single component but can result from complex interactions across application performance and middleware security layers.

The combination of JVM resource exhaustion and connection filtering behavior created a compounded failure scenario, leading to complete login disruption.

A structured, multi-layer troubleshooting approach—focused on performance, configuration, and governance—enabled effective resolution while minimizing risk.

This reinforces the importance of analyzing both system behavior and security controls together when addressing critical production incidents.

💡 Pro Tip

When troubleshooting Oracle EBS login issues, always validate both:
- JVM health (thread utilization, CPU load)
- Middleware controls (connection filters, access rules)
Ignoring either layer can lead to incomplete or misleading diagnosis.
25/03/2026
Oracle EBS OACORE Server in FAILED_NOT_RESTARTABLE State: Real-Time Issue, RCA and Fix
In Oracle E-Business Suite (EBS) environments, application tier stability is critical to ensure seamless user experience. However, there are scenarios where managed servers behave unexpectedly and require manual intervention. This post walks through a real-world production issue where an OACORE managed server entered a FAILED_NOT_RESTARTABLE state, its impact, root cause analysis, and how it was resolved.

Environment Details
- Oracle E-Business Suite: R12.2.x
- Application Tier: WebLogic Managed Servers
- Component Impacted: OACORE Server (oacore_server1)
- Environment Type: Production
Problem Statement

An alert was received indicating oacore_server1 was in FAILED_NOT_RESTARTABLE state. Upon verification, the server was running but Node Manager could not auto-restart it.

Key Observations

Despite the OACORE server being in a failed state, the application remained accessible and functional — traffic was being handled by other OACORE servers. This is due to the multi-OACORE architecture with load balancing via OHS/Web tier. However, this creates a hidden risk: load redistribution increases pressure on remaining servers and can lead to cascading failures if not addressed promptly.

Detailed Analysis
- Managed server restart attempts failed during initialization
- Bulk concurrent requests were actively running
- CPU utilization spiked on the application tier
- JVM resources were under pressure
Understanding FAILED_NOT_RESTARTABLE

In Oracle WebLogic Server, a managed server is marked as FAILED_NOT_RESTARTABLE after repeated unsuccessful restart attempts. This is a protective mechanism designed to prevent unstable restart loops when the server cannot recover successfully.

Root Cause Analysis

The OACORE managed server entered FAILED_NOT_RESTARTABLE state due to repeated startup failures following an unclean or resource-constrained shutdown. High CPU utilization and heavy concurrent workload placed JVM resources under pressure, preventing a clean restart cycle. Residual runtime artifacts (such as incomplete shutdown state or resource locks) prevented successful reinitialization, causing WebLogic to mark the server as FAILED_NOT_RESTARTABLE.

Resolution
```
cd $ADMIN_SCRIPTS_HOME
./admanagedsrvctl.sh stop oacore_server1
./admanagedsrvctl.sh start oacore_server1
```
After the controlled restart, the server returned to RUNNING state with all deployments active and the application stable.

Identify Inactive Forms Sessions

Inactive sessions holding resources can contribute to JVM pressure. Use this query to identify them safely — do not terminate without proper validation and approvals:
```
SELECT s.sid,
       s.serial#,
       s.username,
       s.status,
       s.program,
       s.machine,
       ROUND(s.last_call_et/3600,2) AS hours_inactive
FROM v$session s
WHERE s.status = 'INACTIVE'
AND s.username = 'APPS'
AND s.program LIKE 'frmweb%'
AND s.last_call_et > 28800   -- 8 hours
ORDER BY hours_inactive DESC;
```
Reference only — do NOT execute without validation:
```
ALTER SYSTEM KILL SESSION 'SID,SERIAL#' IMMEDIATE;
```
Automate Session Monitoring

Use this script to monitor inactive sessions every 8 hours via cron:
```
#!/bin/bash
export ORACLE_SID=your_sid
export ORACLE_HOME=/path/to/oracle_home
export PATH=$ORACLE_HOME/bin:$PATH

sqlplus -s / as sysdba <<EOF
SET LINES 200
SET PAGES 200
SELECT COUNT(*) AS inactive_sessions
FROM v\$session
WHERE status='INACTIVE'
AND username='APPS'
AND program LIKE 'frmweb%'
AND last_call_et > 28800;
EXIT;
EOF
```
```
# Crontab entry - every 8 hours
0 */8 * * * /path/to/inactive_sessions.sh >> /tmp/inactive_sessions.log
```
DBA Quick Commands
```
-- Check system load
top
uptime
ps -ef | grep oacore

-- Check running concurrent requests
SELECT request_id, phase_code, status_code
FROM fnd_concurrent_requests
WHERE phase_code = 'R';
```
Key Takeaways
- Application may appear healthy even when an OACORE server fails due to load balancing
- FAILED_NOT_RESTARTABLE is a protective mechanism, not the root cause itself
- Resource pressure and restart failures must be analyzed together
- Controlled and governed actions are critical in production environments
- Proactive session monitoring via automation helps prevent recurrence
Have questions or faced a similar issue? Reach out at sdanwarahmed@gmail.com.
24/03/2026
Oracle DB Health Check Scripts: Real-World DBA Monitoring Guide
In day-to-day Oracle Database and E-Business Suite (EBS) administration, proactive monitoring is critical to ensure system stability, performance, and availability. As part of my real-world DBA experience, I created a set of health check scripts to quickly assess the status of key database components.

Objective

The goal of these scripts is to provide a quick and efficient health check covering database status, instance status, tablespace usage, invalid objects, and RMAN backup status — all from a single SQL*Plus session.

Solution Overview

I developed a SQL-based script (hc.sql) that gathers essential health metrics from Oracle dynamic views and DBA tables. The scripts are designed to be lightweight, fast, and provide immediate actionable insights.

GitHub Repository

The complete script is available on GitHub:
https://github.com/Syedanwarahmed/scripts_for_healthcheck

Key Features
- Database and instance status check
- Tablespace usage monitoring
- Active session tracking
- Invalid object detection
- RMAN backup job verification
- Easy execution using SQL*Plus
How to Use
```
sqlplus / as sysdba
@hc.sql
```
Sample Checks Included

Database Status
```
SELECT NAME, OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;
```
Tablespace Usage
```
SELECT tablespace_name,
       ROUND(used_space * 8192 / 1024 / 1024, 2) used_mb,
       ROUND(tablespace_size * 8192 / 1024 / 1024, 2) total_mb,
       ROUND(used_percent, 2) pct_used
FROM dba_tablespace_usage_metrics
ORDER BY used_percent DESC;
```
Active Sessions
```
SELECT COUNT(*) active_sessions FROM V$SESSION WHERE STATUS='ACTIVE';

-- Session breakdown by status
SELECT status, COUNT(*) cnt
FROM v$session
GROUP BY status
ORDER BY cnt DESC;
```
Invalid Objects
```
SELECT owner, object_type, COUNT(*) cnt
FROM dba_objects
WHERE status = 'INVALID'
GROUP BY owner, object_type
ORDER BY cnt DESC;
```
RMAN Backup Status
```
SELECT status, start_time, end_time,
       ROUND((output_bytes/1024/1024/1024),2) output_gb
FROM V$RMAN_BACKUP_JOB_DETAILS
ORDER BY start_time DESC
FETCH FIRST 5 ROWS ONLY;
```
Real-World Value

These scripts are derived from real-time production support scenarios where quick diagnosis is required during high CPU issues, backup failures, tablespace alerts, and performance degradation. Having a single script to validate system health saves valuable time during critical situations.

Future Enhancements

I plan to enhance this repository further by adding shell automation scripts, alerting mechanisms, integration with monitoring tools, and additional EBS-specific checks.

Conclusion

A well-designed health check script is an essential tool for every DBA. It not only helps in proactive monitoring but also ensures faster troubleshooting and improved system reliability. Feel free to explore the repository, use the scripts, and share your feedback.

Written by Syed Anwar Ahmed — Oracle Apps DBA with 11 years of production experience.
Connect: sdanwarahmed@gmail.com | LinkedIn | GitHub
19/03/2026

State	Description	Reusable
ACTIVE	Used by active transactions	❌
UNEXPIRED	Retained for consistency	⚠️
EXPIRED	No longer needed	✅

Component	Before Fix	After Fix
ADOP Session	Failed	Successful
Node Consistency	Partial	Full
Restart Behavior	Stuck	Clean
Patch Execution	Incomplete	Completed

Metric	Observation
Average Switch Rate	5-7 per hour
Peak Rate	8-10 per hour during business hours
Off-Peak Rate	1-3 per hour

Author: SYED ANWAR AHMED

Initial Observations

The Turning Point

Key Insight

Real Usage Calculation

Root Cause of the False Alert

The Correct Monitoring Approach

Key Takeaways

Introduction

The Incident

Environment Context

Investigation Approach

Step 1: Active Transactions

Step 2: UNDO Extent Analysis

Key Observation

💡 Critical Insight

Understanding UNDO Behavior

Why the Alert Was Misleading

Limitation

Root Cause

When UNDO Is Actually a Problem

Recommended Monitoring Approach

Improved Alert Logic

Key Takeaways

Conclusion

Level Insight

Final Thought

Author

Symptoms Observed

Timeline of Events

Investigation

Step 1: Check ADOP Session State

Step 2: Check adalldefaults.txt

Step 3: Check Install Processes Table

Root Cause

Resolution Steps

Step 1: Backup Critical Tables

Step 2: Drop Stale Metadata Tables

Step 3: Reset the Restart Directory

Step 4: Re-run the Patch

Before vs After

Key Takeaways

When NOT to Use This Approach

Observed Symptoms

Investigation

Root Cause

Resolution

Preventive Measures

Key Takeaways

The Incident

Root Cause

Recommendations

Key Takeaways

Observed Symptoms

Phase 1: Interpreting ORA-00031 Correctly

Phase 2: Identifying the True Performance Signal

Phase 3: Evidence-Based Analysis (Read-Only)

Findings

Why No Changes Were Made

Recommendations

Key Takeaways

Investigation

Root Cause

Immediate Fix

Configuration Fixes

Automation

Key Takeaways

📌 Introduction

⚠️ Issue Summary

🔍 Initial Observation

🔬 Detailed Analysis

🧠 Understanding the Components

OACORE (Application Layer)

Oracle WebLogic Server Connection Filter

🎯 Root Cause Analysis (RCA)

🛠️ Resolution Approach (Controlled & Safe)

🔹 Step 1: Identify Unresponsive JVM

🔹 Step 2: Handle Stuck JVM (Controlled Action)

🔹 Step 3: Rolling Restart of OACORE

🔹 Step 4: Validate WebLogic Connection Filter