Support Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
jomof
New Contributor III

Examination of HA log reveal a critical error

Hello Expert,

 

I have a 2 400E box that are configured as an active/passive cluster.

Recently we are seeing in the HA log a lot of lost heartbeat errors.

 

As per redundancy  we quick at another heartbeat interface to the Ha configuration

We are still receiving error message. (see screen shots .

Screenshot 2024-05-19 144817.pngScreenshot 2024-05-19 144625.png

 

i humbly request some urgent feedback on the way forward,

 

Thank you.

1 Solution
fricci_FTNT
Staff
Staff

Hi @jomof ,

 

My understanding is that you are experiencing those missing heartbeat message randomly and the 2 units are directly connected by a cable. You have already attempted the below steps:

- replace the HA cable with a brand new one and issue is still there

- add a second port to the HA config setting and the messages are still there.

 

I can see that the error is not very frequent, usually when there is a hardware fault those messages are more frequent based on my experience. Did you notice if the message is seen when there is a particular peak in traffic?
Has the traffic increased in the past weeks?
In addition to what my colleague Suraj suggested, please check/monitor the number of sessions especially while the packet lost message is seen.

get sys performance status
get sys performance firewall statistics
diag sys session stat


If you should notice a too high amount of session when the messages are seen, you may try to implement a delay in session synchronization:

config system ha
 set session-pickup-delay enable
end

or dedicate a second port just to the session sync ("set session-sync-dev portX").

The article below might help:
https://docs.fortinet.com/document/fortigate/6.0.0/handbook/495912/improving-session-sync-performanc...

Best regards,

 

---
If you have found a useful article or a solution, please like and accept it to make it easily accessible to others.

View solution in original post

11 REPLIES 11
ozkanaltas
Contributor III

Hi @jomof ,

 

According to my previous experiences, this error is caused by the cable. Have you tried changing the cable connected to the HA port?

If you have found a solution, please like and accept it to make it easily accessible to others.
NSE 4-5-6-7 OT Sec - ENT FW
If you have found a solution, please like and accept it to make it easily accessible to others.NSE 4-5-6-7 OT Sec - ENT FW
jomof
New Contributor III

Yes we change the cable but the same error unsure about my next step.

 

Thanks

amuda
Staff
Staff

Hi @jomof ,

 

Is this HA directly connected or they are going through a switch etc?

 

You may refer here how to troubleshoot the HA heartbeat packet lost:

 

https://community.fortinet.com/t5/FortiGate/Troubleshooting-Tip-How-to-troubleshoot-HA-Heartbeat-pac...

Amerul
APAC TAC
jomof
New Contributor III

Hello @Ramu 

 

Is this HA directly connected or they are going through a switch etc? They are connected directly.

 

Regards

srajeswaran
Staff
Staff

Can you run "get system performance status" to see if there is any specific cores with high CPU usage? You may also check the CPU graphs for last 24 hours to see if there were any CPU spikes matching with heartbeat messages.

 

Can you also check the heartbeat interval?

Example configuration:

 

config system ha
    set hb-lost-threshold 6
    set hello-holddown 20
    set hb-interval 2
end

Ref: https://community.fortinet.com/t5/FortiGate/Technical-Tip-Changing-the-HA-heartbeat-timers-to-preven...

Regards,

Suraj

- Have you found a solution? Then give your helper a "Kudos" and mark the solution.

jomof
New Contributor III

Hello @srajeswaran 

 

Guy-Office-1 # get system performance status"
>
Guy-Office-1 # get system performance status
CPU states: 2% user 2% system 0% nice 96% idle 0% iowait 0% irq 0% softirq
CPU0 states: 2% user 2% system 0% nice 95% idle 0% iowait 0% irq 1% softirq
CPU1 states: 2% user 0% system 0% nice 98% idle 0% iowait 0% irq 0% softirq
CPU2 states: 3% user 1% system 0% nice 96% idle 0% iowait 0% irq 0% softirq
CPU3 states: 3% user 1% system 0% nice 95% idle 0% iowait 0% irq 1% softirq
CPU4 states: 1% user 3% system 0% nice 96% idle 0% iowait 0% irq 0% softirq
CPU5 states: 3% user 4% system 0% nice 93% idle 0% iowait 0% irq 0% softirq
Memory: 8040408k total, 3803304k used (47.3%), 3520720k free (43.8%), 716384k freeable (8.9%)
Average network usage: 50566 / 50380 kbps in 1 minute, 50901 / 51482 kbps in 10 minutes, 52656 / 53475 kbps in 30 minutes
Maximal network usage: 78752 / 75084 kbps in 1 minute, 84941 / 85680 kbps in 10 minutes, 148915 / 155399 kbps in 30 minutes
Average sessions: 14999 sessions in 1 minute, 15144 sessions in 10 minutes, 15484 sessions in 30 minutes
Maximal sessions: 16574 sessions in 1 minute, 16574 sessions in 10 minutes, 18803 sessions in 30 minutes
Average session setup rate: 210 sessions per second in last 1 minute, 186 sessions per second in last 10 minutes, 195 sessions per second in last 30 minutes
Maximal session setup rate: 726 sessions per second in last 1 minute, 734 sessions per second in last 10 minutes, 734 sessions per second in last 30 minutes
Average NPU sessions: 6398 sessions in last 1 minute, 6463 sessions in last 10 minutes, 6393 sessions in last 30 minutes
Maximal NPU sessions: 6531 sessions in last 1 minute, 6726 sessions in last 10 minutes, 7059 sessions in last 30 minutes
Average nTurbo sessions: 6329 sessions in last 1 minute, 6394 sessions in last 10 minutes, 6324 sessions in last 30 minutes
Maximal nTurbo sessions: 6462 sessions in last 1 minute, 6656 sessions in last 10 minutes, 6992 sessions in last 30 minutes
Virus caught: 0 total in 1 minute
IPS attacks blocked: 0 total in 1 minute
Uptime: 692 days, 19 hours, 53 minutes

srajeswaran

This output looks very normal and we don't expect the heartbeat misses during this , we need to check this output when the issue is happening. Can you check the CPU usage history to see if there were any spikes matching with the heartbeat misses reported timeframe?

also, we can try increasing the hb-interval to 3 seconds for couple of days and check the behavior.

config system ha
    set hb-lost-threshold 6
    set hello-holddown 20
    set hb-interval 2
end

Regards,

Suraj

- Have you found a solution? Then give your helper a "Kudos" and mark the solution.

fricci_FTNT
Staff
Staff

Hi @jomof ,

 

My understanding is that you are experiencing those missing heartbeat message randomly and the 2 units are directly connected by a cable. You have already attempted the below steps:

- replace the HA cable with a brand new one and issue is still there

- add a second port to the HA config setting and the messages are still there.

 

I can see that the error is not very frequent, usually when there is a hardware fault those messages are more frequent based on my experience. Did you notice if the message is seen when there is a particular peak in traffic?
Has the traffic increased in the past weeks?
In addition to what my colleague Suraj suggested, please check/monitor the number of sessions especially while the packet lost message is seen.

get sys performance status
get sys performance firewall statistics
diag sys session stat


If you should notice a too high amount of session when the messages are seen, you may try to implement a delay in session synchronization:

config system ha
 set session-pickup-delay enable
end

or dedicate a second port just to the session sync ("set session-sync-dev portX").

The article below might help:
https://docs.fortinet.com/document/fortigate/6.0.0/handbook/495912/improving-session-sync-performanc...

Best regards,

 

---
If you have found a useful article or a solution, please like and accept it to make it easily accessible to others.
jomof
New Contributor III

Hello Fricci,

 

Thanks for the information indeed we notice a huge spike in session when the errors occurs

 

see below the results

 

Guy-Office-1 # get sys performance firewall statistics
getting traffic statistics...
Browsing: 119867117877 packets, 61785353517897 bytes
DNS: 1568453097 packets, 194894504435 bytes
E-Mail: 24416031 packets, 3368362182 bytes
FTP: 12721284 packets, 10862589103 bytes
Gaming: 0 packets, 0 bytes
IM: 3717 packets, 195104 bytes
Newsgroups: 1322 packets, 68336 bytes
P2P: 16659 packets, 878305 bytes
Streaming: 28082537 packets, 1141430125 bytes
TFTP: 235305 packets, 46127443 bytes
VoIP: 141822 packets, 39461303 bytes
Generic TCP: 96858224209 packets, 65140834221344 bytes
Generic UDP: 40119232847 packets, 24020129057933 bytes
Generic ICMP: 7513076877 packets, 359249990329 bytes
Generic IP: 39421561 packets, 2530947544 bytes

 

Guy-Office-1 # diag sys session stat
misc info: session_count=14425 setup_rate=135 exp_count=430 clash=782
memory_tension_drop=0 ephemeral=0/588800 removeable=0
npu_session_count=6429
nturbo_session_count=6360
delete=758971, flush=60840, dev_down=230/174 ses_walkers=0
TCP sessions:
430 in NONE state
6202 in ESTABLISHED state
371 in SYN_SENT state
4 in SYN_RECV state
12 in FIN_WAIT state
198 in TIME_WAIT state
291 in CLOSE state
167 in CLOSE_WAIT state
firewall error stat:
error1=00000000
error2=00000000
error3=00000000
error4=00000000
tt=00000000
cont=00000000
ips_recv=3efb4c31
url_recv=00000000
av_recv=00ffbf62
fqdn_count=0000001e
fqdn6_count=00000000
global: ses_limit=0 ses6_limit=0 rt_limit=0 rt6_limit=0

Labels
Top Kudoed Authors