AHF – Syed Anwar Ahmed – Oracle DBA Blog

Category: Oracle RAC | Troubleshooting | AHF/TFA | Level: Intermediate to Advanced

Background

We recently had a critical production incident on our two-node Oracle 11g RAC cluster where the Fast Recovery Area (FRA) hit capacity, causing both instances to enter an INTERMEDIATE state due to a Stuck Archiver condition. Oracle Support raised an SR and asked for CRS diagnostic data collected using TFA (Trace File Analyzer).

That’s when we discovered a second problem — TFA was completely non-functional on both nodes with the infamous TFA-00002 error. This post documents the full journey of diagnosing and fixing TFA, and how we manually collected the CRS logs for the SR in the meantime.

The SR Request

Oracle Support requested the following:

CRS alert log from all nodes: <ORACLE_BASE>/diag/crs/*/crs/trace/alert.log
All CRS-related trace files updated during the incident period

Step 1 — Finding the CRS Alert Log

The first challenge was locating the CRS logs. This cluster has a separate Grid Infrastructure installation with a different OS user (grid) from the database (oracle).

[oracle@racnode1 ~]$ echo $ORACLE_BASE
/u01/oradb/oracle

Switching to the grid user:

su - grid
echo $ORACLE_HOME
# /u01/oragrid/11.2/grid

ORACLE_BASE was not set for the grid user, so we used the orabase binary:

$ORACLE_HOME/bin/orabase
# /u01/oragrid/oracle

However, the ADR path (/u01/oragrid/oracle/diag/crs/*/crs/trace/alert.log) didn’t exist. This is because Oracle 11.2 Grid Infrastructure uses a different log location — not the ADR diag tree. The correct format is:

$GRID_HOME/log/<hostname>/alert<hostname>.log

The correct path on our cluster:

ls -lh /u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log
# -rw-rw-r--. 1 grid oinstall 14M Apr 12 12:53 alertracnode1.log

⚠️ Key takeaway: On 11.2 GI, CRS alert logs live under $GRID_HOME/log/<hostname>/ — not in the ADR structure used by 12c and later.

Step 2 — Manually Collecting CRS Logs

The log directory structure under $GRID_HOME/log/<hostname>/ includes:

alertracnode1.log    ← Main CRS alert log
crsd/                ← CRSD rotating logs (crsd.log, crsd.l01, crsd.l02 ...)
cssd/                ← CSS daemon logs
ohasd/               ← Oracle High Availability Services logs
ctssd/               ← Cluster Time Sync Service logs

Since TFA was broken, we collected manually:

On node 1:

tar cvf /tmp/crstrace.racnode1.$(date +%Y%m%d%H%M%S).tar \
  /u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.log \
  /u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.l01 \
  /u01/oragrid/11.2/grid/log/racnode1/crsd/crsdOUT.log \
  /u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log \
  /u01/oragrid/11.2/grid/log/racnode1/cssd/ \
  /u01/oragrid/11.2/grid/log/racnode1/ohasd/

zip /tmp/crstrace.racnode1.zip /tmp/crstrace.racnode1.*.tar

On node 2:

ssh racnode2 "tar cvf /tmp/crstrace.racnode2.$(date +%Y%m%d%H%M%S).tar \
  /u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.log \
  /u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.l01 \
  /u01/oragrid/11.2/grid/log/racnode2/alertracnode2.log \
  /u01/oragrid/11.2/grid/log/racnode2/cssd/ \
  /u01/oragrid/11.2/grid/log/racnode2/ohasd/ && \
  zip /tmp/crstrace.racnode2.zip /tmp/crstrace.racnode2.*.tar"

scp racnode2:/tmp/crstrace.racnode2.zip /tmp/

Note: The crsd logs use a rotating format (.log, .l01, .l02 …) — not .trc files. The incident-period data was in crsd.l01.

Step 3 — Diagnosing TFA-00002

With the SR logs uploaded, we turned to fixing TFA. Here’s what we found:

tfactl status
# TFA-00002 Oracle Trace File Analyzer (TFA) is not running
# TFA-00107 TFA failed to start after multiple attempts of start (retries from init.tfa)

Checking AHF Installation Layout

cat /etc/oracle.ahf.loc
# /opt/oracle.ahf

cat /opt/oracle.ahf/install.properties
# AHF_HOME=/opt/oracle.ahf
# BUILD_VERSION=2603000
# BUILD_DATE=202604061821
# TFA_HOME=/opt/oracle.ahf/tfa
# DATA_DIR=/u01/oragrid/oracle/oracle.ahf/data

The AHF binaries were at /opt/oracle.ahf/ and data at /u01/oragrid/oracle/oracle.ahf/data/ — a non-default split layout.

The Actual Error — AHF-07250

Checking the systemd journal revealed the real error:

journalctl -u oracle-tfa --no-pager | tail -20

init.tfa: AHF-07250: Cannot establish connection with TFA Server.
init.tfa: Cause: Cannot establish connection with TFA server on 5000.
init.tfa: Action: Ensure that communication is open on port 5000 and
          that no firewall is blocking port 5000.
init.tfa: ERROR: TFAMain is spawning too fast, Human intervention required!!!
init.tfa: Disabling TFA at : ...

What We Ruled Out

Check	Result
Port 5000 blocked by iptables	Not blocked — policy ACCEPT
SELinux enforcing	Disabled
Java missing/incompatible	Java 11.0.30 — fine
Disk space	16GB free on /, 410GB on /u01
portmapping.txt / ssl.properties missing	Missing — but not the root cause

The TFA Java process was crashing before it could bind to port 5000. The AHF upgrade had left TFA in an unrecoverable broken state on both nodes.

Attempted Fix — tfactl syncnodes

tfactl syncnodes
# Generating new TFA Certificates...
# Successfully generated certificates.
# ...
# TFA-00002 Oracle Trace File Analyzer (TFA) is not running

Certificates were synced successfully but TFA still wouldn’t start. The issue was deeper than certificate mismatches.

Step 4 — The Fix: Clean AHF Reinstall

Uninstall on node 1

ahfctl uninstall -local
# AHF will be uninstalled on: racnode1
# Do you want to continue with AHF uninstall ? [Y]|N : Y
# ...
# CHA is disabled

Note: Uninstalling AHF does NOT remove the data/repository directory, so historical collections and diag data are preserved.

Download AHF Installer

Download AHF-LINUX_v26.x.x.zip from My Oracle Support and stage it to /tmp/ on node 1.

🔗 MOS Doc ID 2550798.1 — Autonomous Health Framework (AHF) Download

Reinstall on both nodes from node 1

unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install
cd /tmp/ahf_install
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir /u01/oragrid/oracle/oracle.ahf/data

Answer N to email notification. Answer Y to install on cluster nodes when prompted.

Node 2 needed a separate local reinstall

The cluster-wide install didn’t fully fix node 2. We reinstalled locally using the -local flag:

# From node 1
scp /tmp/AHF-LINUX_v26.3.0.zip racnode2:/tmp/

ssh racnode2 "ahfctl uninstall -local"

ssh racnode2 "unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install && \
  cd /tmp/ahf_install && \
  ./ahf_setup -ahf_loc /opt/oracle.ahf \
  -data_dir /u01/oragrid/oracle/oracle.ahf/data -local"

The -local flag skips cluster coordination and installs cleanly on the local node only.

Final Verification

tfactl print status

| Host     | Status of TFA | PID   | Port | Version    | Inventory Status |
|----------|---------------|-------|------|------------|------------------|
| racnode1 | RUNNING       |  6355 | 5000 | 26.3.0.0.0 | COMPLETE         |
| racnode2 | RUNNING       | 28301 | 5000 | 26.3.0.0.0 | COMPLETE         |

Both nodes RUNNING with COMPLETE inventory status. ✅

Summary

Problem	Root Cause	Fix
CRS alert log not found at ADR path	11.2 GI uses $GRID_HOME/log/hostname/ not ADR	Collect from $GRID_HOME/log/ directly
TFA-00002 on both nodes	AHF upgrade left TFA in broken state	Clean uninstall + reinstall of AHF 26.3.0
TFA not starting after syncnodes	Deeper corruption beyond cert mismatch	Full reinstall with -local flag on each node

Key Commands Reference

# Find CRS alert log on 11.2 GI
ls $GRID_HOME/log/$(hostname)/alert$(hostname).log

# Collect CRS logs manually
tar cvf /tmp/crstrace.$(hostname).tar \
  $GRID_HOME/log/$(hostname)/crsd/crsd.log \
  $GRID_HOME/log/$(hostname)/crsd/crsd.l01 \
  $GRID_HOME/log/$(hostname)/alert$(hostname).log \
  $GRID_HOME/log/$(hostname)/cssd/ \
  $GRID_HOME/log/$(hostname)/ohasd/

# Check TFA status
tfactl print status

# Check actual TFA error
journalctl -u oracle-tfa --no-pager | tail -30

# Uninstall AHF
ahfctl uninstall -local

# Reinstall AHF (cluster-wide)
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir>

# Reinstall AHF (local node only)
./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir> -local

# Collect TFA diagnostics for Support
tfactl diagnosetfa

Tag: AHF

How I Fixed Oracle AHF/TFA Not Starting on an 11g RAC Cluster (TFA-00002 / AHF-07250)