17 Sep 2015

Log Insight 2.5 - The worker node sending this alert was unable to contact the standalone node

I have identified an issue in Log Insight 2.5 where alerts passed via email or to  vROPS contain the following text in the message:

“Notification event – The worker node sending this alert was unable to contact the standalone node. You may receive duplicate notifications for this alert.”

I also confirmed that DNS resolution and reverse lookup functions are working as expected. I was also able to reproduce this issue successfully in a lab environment, with DNS working correctly.

 

The following information was collected from the lab environment:

<LOGINSIGHTNODE>/storage/var/loginsight/runtime.log shows:

[2015-09-16 17:04:07.981+0000] [ScheduledQueryServiceThread/192.168.1.33 ERROR] [com.vmware.loginsight.notifications.AlertNotifier] [Failed to send alert to standalone, 2 retries remaining.]
        at com.vmware.loginsight.notifications.AlertNotifier.relayToMaster(AlertNotifier.java:181)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:127)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:98)
        at com.vmware.loginsight.web.background.ScheduledQueryService$ScheduledQueryServiceImpl.searchAndRaiseAlertIfNeeded(ScheduledQueryService.java:372)

[2015-09-16 17:04:07.982+0000] [ScheduledQueryServiceThread/192.168.1.33 ERROR] [com.vmware.loginsight.notifications.AlertNotifier] [Failed to send alert to standalone, 1 retries remaining.]
        at com.vmware.loginsight.notifications.AlertNotifier.relayToMaster(AlertNotifier.java:181)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:127)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:98)
        at com.vmware.loginsight.web.background.ScheduledQueryService$ScheduledQueryServiceImpl.searchAndRaiseAlertIfNeeded(ScheduledQueryService.java:372)

[2015-09-16 17:04:07.983+0000] [ScheduledQueryServiceThread/192.168.1.33 ERROR] [com.vmware.loginsight.notifications.AlertNotifier] [Failed to send alert to standalone, 0 retries remaining.]
        at com.vmware.loginsight.notifications.AlertNotifier.relayToMaster(AlertNotifier.java:181)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:127)
        at com.vmware.loginsight.notifications.AlertNotifier.sendAlertNotification(AlertNotifier.java:98)
       at com.vmware.loginsight.web.background.ScheduledQueryService$ScheduledQueryServiceImpl.searchAndRaiseAlertIfNeeded(ScheduledQueryService.java:372)

[2015-09-16 17:04:07.984+0000] [ScheduledQueryServiceThread/192.168.1.33 INFO] [com.vmware.loginsight.notifications.AlertNotifier] [Could not connect to Master, sending alert notifications directly.]

 

 

The original Log Insight configuration file contains:

(File is located at: /storage/core/loginsight/config/loginsight-config.xml#33)

  <distributed overwrite-children="true">
    <daemon host="vrlimn01.spiesr.com" port="16520" token="d015a445-76c0-42a4-807c-c68f1485642c">
      <service-group name="standalone" />
    </daemon>
    <daemon host="192.168.1.34" port="16520" token="f3a3d23d-8d37-4e15-a4ee-451044841cbd">
      <service-group name="workernode" />
    </daemon>
  </distributed>

The configuration file contains the FQDN for the master/standalone node when the cluster was created by joining a new data worker node to the cluster using the UI. However, it looks as if there is a bug where the alert thread fails to successfully resolve the host name in DNS, even if DNS is configured and working properly. Strangely, most of the errors logged in our environment is logged by the master(standalone) node, indicating that it was unable to contact the standalone node (itself!).

The really strange thing is, I've been through all of the documentation and also searched high and low on the internet for a fix, yet I was unable to find anyone else who had documented this issue prior. Also, none of the VMware PSO and GSS staff working alongside us on site had seen the issue before. So, I had to go away and do some testing.

Following on from some testing and digging, in order to fix the issue, I found that when the “standalone” node address is changed from the FQDN to the node’s IP address, the issue goes away.

The fix:

1. Using the Admin UI, place the worker nodes in maintenance mode
2. Stop the Log Insight service on each of the worker nodes
service loginsight stop

3. stop the Log Insight service on the master node
service loginsight stop

4. On each node change the configuration in the configuration file for the standalone node from the FQDN to the IP address
vi /storage/core/loginsight/config/loginsight-config.xml#NN

(Where NN matches the file with the highest number)

<distributed overwrite-children="true">
    <daemon host="192.168.1.33" port="16520" token="d015a445-76c0-42a4-807c-c68f1485642c">
      <service-group name="standalone" />
    </daemon>
    <daemon host="192.168.1.34" port="16520" token="f3a3d23d-8d37-4e15-a4ee-451044841cbd">
      <service-group name="workernode" />
    </daemon>
  </distributed>

5. Start the Log Insight service on the mater node
service loginsight start

6. Start the Log Insight service on each worker node
service loginsight start

7. With the changes made to the configuration file and the services restarted, log into the admin UI and re-apply the license key by removing the licence key and re-adding the key back into Log Insight

The cluster status in the UI view will show (I know I only have 2 nodes where 3 is recommended, but this was just a quick test):

Messages now arrive as:

Errors in the runtime event logs relating to the “standalone” node not being contactable no longer appear.

[UPDATE 18 September 2015 13:52 BST]:

VMware has now confirmed that this change is supported. I expect VMware to release a KB article for this issue soon.

 

Written by  0 comment
Last modified on Monday, 06 June 2016 10:11
Rate this item
(2 votes)

Comments (0)

There are no comments posted here yet

Leave your comments

Posting comment as a guest. Sign up or login to your account.
0 Characters
Attachments (0 / 3)
Share Your Location

@simoneady I had to get out and push it past the sign
Follow Rynardt Spies on Twitter