Category: Oracle RAC | Troubleshooting | AHF/TFA | Level: Intermediate to Advanced
Background
We recently had a critical production incident on our two-node Oracle 11g RAC cluster where the Fast Recovery Area (FRA) hit capacity, causing both instances to enter an INTERMEDIATE state due to a Stuck Archiver condition. Oracle Support raised an SR and asked for CRS diagnostic data collected using TFA (Trace File Analyzer).
That’s when we discovered a second problem — TFA was completely non-functional on both nodes with the infamous TFA-00002 error. This post documents the full journey of diagnosing and fixing TFA, and how we manually collected the CRS logs for the SR in the meantime.
The SR Request
Oracle Support requested the following:
- CRS alert log from all nodes:
<ORACLE_BASE>/diag/crs/*/crs/trace/alert.log - All CRS-related trace files updated during the incident period
Step 1 — Finding the CRS Alert Log
The first challenge was locating the CRS logs. This cluster has a separate Grid Infrastructure installation with a different OS user (grid) from the database (oracle).
[oracle@racnode1 ~]$ echo $ORACLE_BASE
/u01/oradb/oracle
Switching to the grid user:
su - grid
echo $ORACLE_HOME
# /u01/oragrid/11.2/grid
ORACLE_BASE was not set for the grid user, so we used the orabase binary:
$ORACLE_HOME/bin/orabase
# /u01/oragrid/oracle
However, the ADR path (/u01/oragrid/oracle/diag/crs/*/crs/trace/alert.log) didn’t exist. This is because Oracle 11.2 Grid Infrastructure uses a different log location — not the ADR diag tree. The correct format is:
$GRID_HOME/log/<hostname>/alert<hostname>.log
The correct path on our cluster:
ls -lh /u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log
# -rw-rw-r--. 1 grid oinstall 14M Apr 12 12:53 alertracnode1.log
⚠️ Key takeaway: On 11.2 GI, CRS alert logs live under $GRID_HOME/log/<hostname>/ — not in the ADR structure used by 12c and later.
Step 2 — Manually Collecting CRS Logs
The log directory structure under $GRID_HOME/log/<hostname>/ includes:
alertracnode1.log ← Main CRS alert log
crsd/ ← CRSD rotating logs (crsd.log, crsd.l01, crsd.l02 ...)
cssd/ ← CSS daemon logs
ohasd/ ← Oracle High Availability Services logs
ctssd/ ← Cluster Time Sync Service logs
Since TFA was broken, we collected manually:
On node 1:
tar cvf /tmp/crstrace.racnode1.$(date +%Y%m%d%H%M%S).tar \
/u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.log \
/u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.l01 \
/u01/oragrid/11.2/grid/log/racnode1/crsd/crsdOUT.log \
/u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log \
/u01/oragrid/11.2/grid/log/racnode1/cssd/ \
/u01/oragrid/11.2/grid/log/racnode1/ohasd/
zip /tmp/crstrace.racnode1.zip /tmp/crstrace.racnode1.*.tar
On node 2:
ssh racnode2 "tar cvf /tmp/crstrace.racnode2.$(date +%Y%m%d%H%M%S).tar \
/u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.log \
/u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.l01 \
/u01/oragrid/11.2/grid/log/racnode2/alertracnode2.log \
/u01/oragrid/11.2/grid/log/racnode2/cssd/ \
/u01/oragrid/11.2/grid/log/racnode2/ohasd/ && \
zip /tmp/crstrace.racnode2.zip /tmp/crstrace.racnode2.*.tar"
scp racnode2:/tmp/crstrace.racnode2.zip /tmp/
Note: The crsd logs use a rotating format (
.log,.l01,.l02…) — not.trcfiles. The incident-period data was incrsd.l01.
Step 3 — Diagnosing TFA-00002
With the SR logs uploaded, we turned to fixing TFA. Here’s what we found:
tfactl status
# TFA-00002 Oracle Trace File Analyzer (TFA) is not running
# TFA-00107 TFA failed to start after multiple attempts of start (retries from init.tfa)
Checking AHF Installation Layout
cat /etc/oracle.ahf.loc
# /opt/oracle.ahf
cat /opt/oracle.ahf/install.properties
# AHF_HOME=/opt/oracle.ahf
# BUILD_VERSION=2603000
# BUILD_DATE=202604061821
# TFA_HOME=/opt/oracle.ahf/tfa
# DATA_DIR=/u01/oragrid/oracle/oracle.ahf/data
The AHF binaries were at /opt/oracle.ahf/ and data at /u01/oragrid/oracle/oracle.ahf/data/ — a non-default split layout.
The Actual Error — AHF-07250
Checking the systemd journal revealed the real error:
journalctl -u oracle-tfa --no-pager | tail -20
init.tfa: AHF-07250: Cannot establish connection with TFA Server.
init.tfa: Cause: Cannot establish connection with TFA server on 5000.
init.tfa: Action: Ensure that communication is open on port 5000 and
that no firewall is blocking port 5000.
init.tfa: ERROR: TFAMain is spawning too fast, Human intervention required!!!
init.tfa: Disabling TFA at : ...
What We Ruled Out
| Check | Result |
|---|---|
| Port 5000 blocked by iptables | Not blocked — policy ACCEPT |
| SELinux enforcing | Disabled |
| Java missing/incompatible | Java 11.0.30 — fine |
| Disk space | 16GB free on /, 410GB on /u01 |
| portmapping.txt / ssl.properties missing | Missing — but not the root cause |
The TFA Java process was crashing before it could bind to port 5000. The AHF upgrade had left TFA in an unrecoverable broken state on both nodes.
Attempted Fix — tfactl syncnodes
tfactl syncnodes
# Generating new TFA Certificates...
# Successfully generated certificates.
# ...
# TFA-00002 Oracle Trace File Analyzer (TFA) is not running
Certificates were synced successfully but TFA still wouldn’t start. The issue was deeper than certificate mismatches.
Step 4 — The Fix: Clean AHF Reinstall
Uninstall on node 1
ahfctl uninstall -local
# AHF will be uninstalled on: racnode1
# Do you want to continue with AHF uninstall ? [Y]|N : Y
# ...
# CHA is disabled
Note: Uninstalling AHF does NOT remove the data/repository directory, so historical collections and diag data are preserved.
Download AHF Installer
Download AHF-LINUX_v26.x.x.zip from My Oracle Support and stage it to /tmp/ on node 1.
🔗 MOS Doc ID 2550798.1 — Autonomous Health Framework (AHF) Download
Reinstall on both nodes from node 1
unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install
cd /tmp/ahf_install
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir /u01/oragrid/oracle/oracle.ahf/data
Answer N to email notification. Answer Y to install on cluster nodes when prompted.
Node 2 needed a separate local reinstall
The cluster-wide install didn’t fully fix node 2. We reinstalled locally using the -local flag:
# From node 1
scp /tmp/AHF-LINUX_v26.3.0.zip racnode2:/tmp/
ssh racnode2 "ahfctl uninstall -local"
ssh racnode2 "unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install && \
cd /tmp/ahf_install && \
./ahf_setup -ahf_loc /opt/oracle.ahf \
-data_dir /u01/oragrid/oracle/oracle.ahf/data -local"
The -local flag skips cluster coordination and installs cleanly on the local node only.
Final Verification
tfactl print status
| Host | Status of TFA | PID | Port | Version | Inventory Status |
|----------|---------------|-------|------|------------|------------------|
| racnode1 | RUNNING | 6355 | 5000 | 26.3.0.0.0 | COMPLETE |
| racnode2 | RUNNING | 28301 | 5000 | 26.3.0.0.0 | COMPLETE |
Both nodes RUNNING with COMPLETE inventory status. ✅
Summary
| Problem | Root Cause | Fix |
|---|---|---|
| CRS alert log not found at ADR path | 11.2 GI uses $GRID_HOME/log/hostname/ not ADR | Collect from $GRID_HOME/log/ directly |
| TFA-00002 on both nodes | AHF upgrade left TFA in broken state | Clean uninstall + reinstall of AHF 26.3.0 |
| TFA not starting after syncnodes | Deeper corruption beyond cert mismatch | Full reinstall with -local flag on each node |
Key Commands Reference
# Find CRS alert log on 11.2 GI
ls $GRID_HOME/log/$(hostname)/alert$(hostname).log
# Collect CRS logs manually
tar cvf /tmp/crstrace.$(hostname).tar \
$GRID_HOME/log/$(hostname)/crsd/crsd.log \
$GRID_HOME/log/$(hostname)/crsd/crsd.l01 \
$GRID_HOME/log/$(hostname)/alert$(hostname).log \
$GRID_HOME/log/$(hostname)/cssd/ \
$GRID_HOME/log/$(hostname)/ohasd/
# Check TFA status
tfactl print status
# Check actual TFA error
journalctl -u oracle-tfa --no-pager | tail -30
# Uninstall AHF
ahfctl uninstall -local
# Reinstall AHF (cluster-wide)
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir>
# Reinstall AHF (local node only)
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir> -local
# Collect TFA diagnostics for Support
tfactl diagnosetfa
Leave a comment