Self-healing
Introduction
Self-healing crystalises the capability of autonomously recovering affected parts of the system both at the hardware and software level caused by failures or abnormal states. It also can restart the system to pre-established routines scheduling, if necessary.
Features
The aerOS Self-Healing module provides automated recovery mechanisms to detect and mitigate failures in Infrastructure Elements (IEs). It continuously monitors system health and applies corrective actions when failures occur.
The module currently monitors the following scenarios:
1. Sensor Failure
- Scenario This can be identified by reading the values that the sensor provides to the device/RPi:
No measurement at the device – indicates failure in the sensing part, given that all the other functionalities are normal). In this case, the failure would be reported by the self-healing to the Trust-Manager, considering the RPi is an IE with limited capabilities in aerOS nomenclature (for start, it will just print a message).
A sensor measurement which is indicated as outlier, through an internal procedure in the device or in the diagnosis component.
Action Healing in this scenario could be applied by creating and sending alert messages for excluding the sensor from the set of those that provide input to the system.
2. Device Power Alert
Scenario Similarly, to scenario 1, the power levels of the device can be measured and reported. Compared to scenario 1, the stimulus is coming from the device itself and the potential failure is more severe since it refers to the entire IE component and not a part of it (e.g., one of the sensors).
Action Healing in this scenario could be applied by creating and sending alert messages for recharging / battery replacement.
3. Network Protocol Violation
Scenario A link-level protocol may operate in unlicensed bands (e.g., WiFi, LoRa); thus, it may have some Duty Cycle (DC) limitations. We could set monitoring agents at the GW to check for potential violation and command the GW to reconfigure the DC value. Typical values of DC include 0.1%, 1%, and 10%. We envisage that the network violation scenario could be set as a family of abnormal scenarios, and we see great value in detecting such problems.
Action Healing in this scenario could be applied by enforcing reconfiguration of the IE.
4. Link Quality Issues
Scenario In this scenario, radio values of IE communication are reported (e.g., to a Gateway, Base Station, Access Point) and the values are stored. Once these values are dropped below an expected threshold (the threshold is decided based on past values) this is reported to Trust-Manager.
Action The healing can be applied by sending commands to the GW to reconfigure the link parameters like the SF and the rate.
5. Communication Failure Indication (no messages received by IE)
Scenario This is a critical failure that cannot be addressed easily, especially if the communication is lost due to network / hw issues at the IE side. However, since an indication of communication failure could be also due to no issue, e.g., because IE has nothing to send!
Action Possibly, we could set a dedicated channel (e.g., a wifi connection) for polling (check if alive) messages towards reaching the targeted IE.
Place in architecture
Self-healing is one of the two self-* capabilities that can interact directly with IE’s hardware, as depicted in Figure 1.
Figure 1: aerOS self-* capabilites
User guide
This component provides a FastAPI-based REST API interface that allows retrieving self-healing alerts. The API returns structured alert messages, enabling other components to retrieve failure events efficiently.
The following table describes the available API endpoint:
Method |
Endpoint |
Description |
Payload (if need) |
Response format |
|---|---|---|---|---|
GET |
/alerts |
Returns self-healing alert messages since a given timestamp |
since: Optional timestamp (e.g., “2024-02-19T10:00:00”) |
{ “alerts”: [ { “timestamp”: “YYYY-MM-DDTHH:MM:SS.ssssss”, “scenario”: “<string>”, “message”: “<string>”, “mac_address”: “<string>” } ] } |
Prerequisities
To start using the self-healing module, please visit the Common deployments repository for more information.
The module itself do not require the deployment of other components first (i.e. deploying self-adaptation or self-optimization before self-healing will not trigger any errors). Once deployed, it will automatically start monitoring IoT devices directly connected to Infrastructure Elements (IEs) and detecting abnormal states associated to typical IoT scenarios.
For the complete communication, the component should be deployed within the IE of working with self-api and Trust-manager.
Installation
Configuration options
Developer guide
To test the module locally:
Download or clone the self-healing repository:
git clone https://gitlab.aeros-project.eu/wp3/t3.5/self-healing
Create and activate a Python virtual environment:
python3 -m venv self-healing-env source venv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
Run the module:
python src/self_healing_app/main.py
Notes
If you are using a different command to run Python (e.g., python instead of python3, proceed accordingly.) If you are using different OS, you might need a different command to activate the virtual environment.
License
The software is licensed under Apache License v2.0