Self-healing

Introduction

Self-healing crystalises the capability of autonomously recovering affected parts of the system both at the hardware and software level caused by failures or abnormal states. It also can restart the system to pre-established routines scheduling, if necessary.

Features

Place in architecture

Self-healing is one of the two self-* capabilities that can interact directly with IE’s hardware, as depicted in Figure 1.

aerOS self-* capabilities

Figure 1: aerOS self-* capabilites

User guide

1. Sensor Failure

  • Scenario This can be identified by reading the values that the sensor provides to the device/RPi:
    • No measurement at the device – indicates failure in the sensing part, given that all the other functionalities are normal). In this case, the failure would be reported by the self-healing to the self-diagnose, considering the RPi is an IE with limited capabilities in aerOS nomenclature (for start, it will just print a message).

    • A sensor measurement which is indicated as outlier, through an internal procedure in the device or in the diagnosis component.

  • Action Healing in this scenario could be applied by creating and sending alert messages for excluding the sensor from the set of those that provide input to the system.

2. Device Power Alert

  • Scenario Similarly, to scenario 1, the power levels of the device can be measured and reported. Compared to scenario 1, the stimulus is coming from the device itself and the potential failure is more severe since it refers to the entire IE component and not a part of it (e.g., one of the sensors).

  • Action Healing in this scenario could be applied by creating and sending alert messages for recharging / battery replacement.

3. Network Protocol Violation

  • Scenario A link-level protocol may operate in unlicensed bands (e.g., WiFi, LoRa); thus, it may have some Duty Cycle (DC) limitations. We could set monitoring agents at the GW to check for potential violation and command the GW to reconfigure the DC value. Typical values of DC include 0.1%, 1%, and 10%. We envisage that the network violation scenario could be set as a family of abnormal scenarios, and we see great value in detecting such problems.

  • Action Healing in this scenario could be applied by enforcing reconfiguration of the IE.

5. Communication Failure Indication (no messages received by IE)

  • Scenario This is a critical failure that cannot be addressed easily, especially if the communication is lost due to network / hw issues at the IE side. However, since an indication of communication failure could be also due to no issue, e.g., because IE has nothing to send!

  • Action Possibly, we could set a dedicated channel (e.g., a wifi connection) for polling (check if alive) messages towards reaching the targeted IE.

Prerequisities

Installation

Configuration options

Developer guide

Authors

License

The software is licensed under Apache License v2.0

Notice (dependencies)