Tag: AHF

  • How I Fixed Oracle AHF/TFA Not Starting on an 11g RAC Cluster (TFA-00002 / AHF-07250)

    Category: Oracle RAC | Troubleshooting | AHF/TFA  |  Level: Intermediate to Advanced


    Background

    We recently had a critical production incident on our two-node Oracle 11g RAC cluster where the Fast Recovery Area (FRA) hit capacity, causing both instances to enter an INTERMEDIATE state due to a Stuck Archiver condition. Oracle Support raised an SR and asked for CRS diagnostic data collected using TFA (Trace File Analyzer).

    That’s when we discovered a second problem — TFA was completely non-functional on both nodes with the infamous TFA-00002 error. This post documents the full journey of diagnosing and fixing TFA, and how we manually collected the CRS logs for the SR in the meantime.


    The SR Request

    Oracle Support requested the following:

    1. CRS alert log from all nodes: <ORACLE_BASE>/diag/crs/*/crs/trace/alert.log
    2. All CRS-related trace files updated during the incident period

    Step 1 — Finding the CRS Alert Log

    The first challenge was locating the CRS logs. This cluster has a separate Grid Infrastructure installation with a different OS user (grid) from the database (oracle).

    [oracle@racnode1 ~]$ echo $ORACLE_BASE
    /u01/oradb/oracle

    Switching to the grid user:

    su - grid
    echo $ORACLE_HOME
    # /u01/oragrid/11.2/grid

    ORACLE_BASE was not set for the grid user, so we used the orabase binary:

    $ORACLE_HOME/bin/orabase
    # /u01/oragrid/oracle

    However, the ADR path (/u01/oragrid/oracle/diag/crs/*/crs/trace/alert.log) didn’t exist. This is because Oracle 11.2 Grid Infrastructure uses a different log location — not the ADR diag tree. The correct format is:

    $GRID_HOME/log/<hostname>/alert<hostname>.log

    The correct path on our cluster:

    ls -lh /u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log
    # -rw-rw-r--. 1 grid oinstall 14M Apr 12 12:53 alertracnode1.log

    ⚠️ Key takeaway: On 11.2 GI, CRS alert logs live under $GRID_HOME/log/<hostname>/ — not in the ADR structure used by 12c and later.


    Step 2 — Manually Collecting CRS Logs

    The log directory structure under $GRID_HOME/log/<hostname>/ includes:

    alertracnode1.log    ← Main CRS alert log
    crsd/                ← CRSD rotating logs (crsd.log, crsd.l01, crsd.l02 ...)
    cssd/                ← CSS daemon logs
    ohasd/               ← Oracle High Availability Services logs
    ctssd/               ← Cluster Time Sync Service logs

    Since TFA was broken, we collected manually:

    On node 1:

    tar cvf /tmp/crstrace.racnode1.$(date +%Y%m%d%H%M%S).tar \
      /u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.log \
      /u01/oragrid/11.2/grid/log/racnode1/crsd/crsd.l01 \
      /u01/oragrid/11.2/grid/log/racnode1/crsd/crsdOUT.log \
      /u01/oragrid/11.2/grid/log/racnode1/alertracnode1.log \
      /u01/oragrid/11.2/grid/log/racnode1/cssd/ \
      /u01/oragrid/11.2/grid/log/racnode1/ohasd/
    
    zip /tmp/crstrace.racnode1.zip /tmp/crstrace.racnode1.*.tar

    On node 2:

    ssh racnode2 "tar cvf /tmp/crstrace.racnode2.$(date +%Y%m%d%H%M%S).tar \
      /u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.log \
      /u01/oragrid/11.2/grid/log/racnode2/crsd/crsd.l01 \
      /u01/oragrid/11.2/grid/log/racnode2/alertracnode2.log \
      /u01/oragrid/11.2/grid/log/racnode2/cssd/ \
      /u01/oragrid/11.2/grid/log/racnode2/ohasd/ && \
      zip /tmp/crstrace.racnode2.zip /tmp/crstrace.racnode2.*.tar"
    
    scp racnode2:/tmp/crstrace.racnode2.zip /tmp/

    Note: The crsd logs use a rotating format (.log, .l01, .l02 …) — not .trc files. The incident-period data was in crsd.l01.


    Step 3 — Diagnosing TFA-00002

    With the SR logs uploaded, we turned to fixing TFA. Here’s what we found:

    tfactl status
    # TFA-00002 Oracle Trace File Analyzer (TFA) is not running
    # TFA-00107 TFA failed to start after multiple attempts of start (retries from init.tfa)

    Checking AHF Installation Layout

    cat /etc/oracle.ahf.loc
    # /opt/oracle.ahf
    
    cat /opt/oracle.ahf/install.properties
    # AHF_HOME=/opt/oracle.ahf
    # BUILD_VERSION=2603000
    # BUILD_DATE=202604061821
    # TFA_HOME=/opt/oracle.ahf/tfa
    # DATA_DIR=/u01/oragrid/oracle/oracle.ahf/data

    The AHF binaries were at /opt/oracle.ahf/ and data at /u01/oragrid/oracle/oracle.ahf/data/ — a non-default split layout.

    The Actual Error — AHF-07250

    Checking the systemd journal revealed the real error:

    journalctl -u oracle-tfa --no-pager | tail -20
    
    init.tfa: AHF-07250: Cannot establish connection with TFA Server.
    init.tfa: Cause: Cannot establish connection with TFA server on 5000.
    init.tfa: Action: Ensure that communication is open on port 5000 and
              that no firewall is blocking port 5000.
    init.tfa: ERROR: TFAMain is spawning too fast, Human intervention required!!!
    init.tfa: Disabling TFA at : ...

    What We Ruled Out

    Check Result
    Port 5000 blocked by iptables Not blocked — policy ACCEPT
    SELinux enforcing Disabled
    Java missing/incompatible Java 11.0.30 — fine
    Disk space 16GB free on /, 410GB on /u01
    portmapping.txt / ssl.properties missing Missing — but not the root cause

    The TFA Java process was crashing before it could bind to port 5000. The AHF upgrade had left TFA in an unrecoverable broken state on both nodes.

    Attempted Fix — tfactl syncnodes

    tfactl syncnodes
    # Generating new TFA Certificates...
    # Successfully generated certificates.
    # ...
    # TFA-00002 Oracle Trace File Analyzer (TFA) is not running

    Certificates were synced successfully but TFA still wouldn’t start. The issue was deeper than certificate mismatches.


    Step 4 — The Fix: Clean AHF Reinstall

    Uninstall on node 1

    ahfctl uninstall -local
    # AHF will be uninstalled on: racnode1
    # Do you want to continue with AHF uninstall ? [Y]|N : Y
    # ...
    # CHA is disabled

    Note: Uninstalling AHF does NOT remove the data/repository directory, so historical collections and diag data are preserved.

    Download AHF Installer

    Download AHF-LINUX_v26.x.x.zip from My Oracle Support and stage it to /tmp/ on node 1.

    🔗 MOS Doc ID 2550798.1 — Autonomous Health Framework (AHF) Download

    Reinstall on both nodes from node 1

    unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install
    cd /tmp/ahf_install
    ./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir /u01/oragrid/oracle/oracle.ahf/data

    Answer N to email notification. Answer Y to install on cluster nodes when prompted.

    Node 2 needed a separate local reinstall

    The cluster-wide install didn’t fully fix node 2. We reinstalled locally using the -local flag:

    # From node 1
    scp /tmp/AHF-LINUX_v26.3.0.zip racnode2:/tmp/
    
    ssh racnode2 "ahfctl uninstall -local"
    
    ssh racnode2 "unzip /tmp/AHF-LINUX_v26.3.0.zip -d /tmp/ahf_install && \
      cd /tmp/ahf_install && \
      ./ahf_setup -ahf_loc /opt/oracle.ahf \
      -data_dir /u01/oragrid/oracle/oracle.ahf/data -local"

    The -local flag skips cluster coordination and installs cleanly on the local node only.


    Final Verification

    tfactl print status
    
    | Host     | Status of TFA | PID   | Port | Version    | Inventory Status |
    |----------|---------------|-------|------|------------|------------------|
    | racnode1 | RUNNING       |  6355 | 5000 | 26.3.0.0.0 | COMPLETE         |
    | racnode2 | RUNNING       | 28301 | 5000 | 26.3.0.0.0 | COMPLETE         |

    Both nodes RUNNING with COMPLETE inventory status. ✅


    Summary

    Problem Root Cause Fix
    CRS alert log not found at ADR path 11.2 GI uses $GRID_HOME/log/hostname/ not ADR Collect from $GRID_HOME/log/ directly
    TFA-00002 on both nodes AHF upgrade left TFA in broken state Clean uninstall + reinstall of AHF 26.3.0
    TFA not starting after syncnodes Deeper corruption beyond cert mismatch Full reinstall with -local flag on each node

    Key Commands Reference

    # Find CRS alert log on 11.2 GI
    ls $GRID_HOME/log/$(hostname)/alert$(hostname).log
    
    # Collect CRS logs manually
    tar cvf /tmp/crstrace.$(hostname).tar \
      $GRID_HOME/log/$(hostname)/crsd/crsd.log \
      $GRID_HOME/log/$(hostname)/crsd/crsd.l01 \
      $GRID_HOME/log/$(hostname)/alert$(hostname).log \
      $GRID_HOME/log/$(hostname)/cssd/ \
      $GRID_HOME/log/$(hostname)/ohasd/
    
    # Check TFA status
    tfactl print status
    
    # Check actual TFA error
    journalctl -u oracle-tfa --no-pager | tail -30
    
    # Uninstall AHF
    ahfctl uninstall -local
    
    # Reinstall AHF (cluster-wide)
    ./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir>
    
    # Reinstall AHF (local node only)
    ./ahf_setup -ahf_loc /opt/oracle.ahf -data_dir <data_dir> -local
    
    # Collect TFA diagnostics for Support
    tfactl diagnosetfa

    References