Technical Note: ZoneFox 4.0 - How Do I Safely Reboot the Elasticsearch Node(s)

FortiKoala · ‎03-01-2019

Description

How Do I Safely Reboot the Elasticsearch Node(s)

Scope

Installation and Administration

Solution

Preparing for reboot

In order to safely reboot one of the Elasticsearch nodes you must ensure that there is no data throughput. This can be done by switching off the Collector Server (CS) on the Windows server(s). The CS can be managed through Internet Information Service (IIS) Manager on the windows server. Select the Collector Server site and under Manage Website click Stop.

With the CS stopped the agents will no longer send events to the server. Instead they will cache events locally until the CS is back online.

Next you should stop the Business Layer. In Task Manager go to the Services tab. There are two services ZF.BL and ZF.BL.Nesper. For both of these right click on them and click Stop. Ensure that the status reports as stopped.

Next stop the Logstash service on the processing node(s). Run the command:

systemctl stop logstash.service

(or ‘service logstash stop’ if you’re running ubuntu 14).

Check that it has stopped successfully with:

systemctl status logstash.service

(or ‘service logstash status’ on ubuntu 14).

The last step is to flush the indices to disk to ensure that all data in memory in written to disk ahead of the reboot. This ensures that there is no data corruption caused by the reboot. Run the following command for each Elasticsearch node:

curl http://<node_machine_name>:9200/_flush

You can also perform this with the kopf plugin. See this guide: https://zonefox.freshdesk.com/support/solutions/articles/26000021273-use-elasticsearch-kopf-to-post-...

Reboot

Now you are ready to safely reboot the Elasticsearch nodes as you need. You can monitor the status of your Elasticsearch cluster with the kopf plugin http://<node_machine_name>:9200/_plugin/kopf

When the top bar is green your Elasticsearch cluster is ready to write to. If it is red it means that it is not ready. It may resolve itself after a few minutes. This will usually happen shortly after you have rebooted one of the nodes.

Ensure that all nodes are listed in the left hand column. If not (and nodes have finished their reboot and are back online) then ssh into those nodes and run this command:

systemctl status elasticsearch-es-01.service

(‘service elasticsearch-es-01 status’ for ubuntu 14).

If the service is down then the elasticsearch service failed to recover from the reboot as expected. The log file /var/log/elasticsearch/ZoneFox_Cluster.log will give the output of any errors that occurred. If you are unable to resolve or identify the issue that led to this please open a support ticket with this log attached explaining what happened. Please upload this log before any attempt to restart the service as it may cause a catastrophic failure.

If you believe that the issue that caused this failure to start has been or is no longer an issue you can attempt to start the elasticsearch service with this command:

systemctl start elasticsearch-es-01.service

(‘service elasticsearch start’ for ubuntu 14).

Use the status command again to check that the service successfully started. You should see the node rejoin the cluster in kopf. The cluster will enter the red state as it brings indices back online. Depending on the size of the indices it could take between several minutes or even hours before the cluster returns to its green state.

Do not start the next section until the status of the Elasticsearch cluster is green and all nodes have rejoined the cluster. Data loss may occur if you do not follow these instructions. If it continues to fail to recover please contact ZoneFox support.

Recovering After Reboot

First restart the logstash service. On the processing node(s) run the command:

systemctl start logstash.service

(‘service logstash start’ on ubuntu 14).

Check that the logstash service has started successfully with the command:

systemctl status logstash.service

(‘service logstash status’ on ubuntu 14).

Next restart the Business Layer on the Windows server(s). In Task Manager go to the Services tab. Right click on the service ZF.BL.Nesper and click Start. Wait until the status is running. Next do the same for the service ZF.BL. Ensure that it is running. The order in which you do this is important. If you failed to follow these instruction stop both services again and start them in the correct order.

Finally start the CS. Go back to Internet Information Service (IIS) Manager, go to the site Collector Server and click Start. To ensure that the CS is running correctly go the url https://<windows_server>:<port(8080_by_default)>. It will return a json output like this:

If it does not check that the machine and port is accessible. If it returns a different output, such as server error, please contact ZoneFox support.

Checking Recovery

Finally you should ensure that the system has recovered to its previous state. As an additional check you should ensure that there is a throughput of events and alerts. The simplest and most effective way of doing this is to monitor the Threat Hunting, the Policy Alerts and AI Alerts pages for new items. However for systems with multiple Windows or processing nodes this may only tell you that some of these nodes have a throughput.

You can also check the system status page in your ZoneFox console. Go to Admin - System and expand the different graphs for different components to check their throughput.