##################### Self-healing ##################### .. contents:: :local: :depth: 1 Introduction ============ Self-healing crystalises the capability of autonomously recovering affected parts of the system both at the hardware and software level caused by failures or abnormal states. It also can restart the system to pre-established routines scheduling, if necessary. Features ======== The aerOS Self-Healing module provides automated recovery mechanisms to detect and mitigate failures in Infrastructure Elements (IEs). It continuously monitors system health and applies corrective actions when failures occur. The module currently monitors the following scenarios: 1. Sensor Failure ----------------- - **Scenario** This can be identified by reading the values that the sensor provides to the device/RPi: - No measurement at the device – indicates failure in the sensing part, given that all the other functionalities are normal). In this case, the failure would be reported by the self-healing to the Trust-Manager, considering the RPi is an IE with limited capabilities in aerOS nomenclature (for start, it will just print a message). - A sensor measurement which is indicated as outlier, through an internal procedure in the device or in the diagnosis component. - **Action** Healing in this scenario could be applied by creating and sending alert messages for excluding the sensor from the set of those that provide input to the system. 2. Device Power Alert --------------------- - **Scenario** Similarly, to scenario 1, the power levels of the device can be measured and reported. Compared to scenario 1, the stimulus is coming from the device itself and the potential failure is more severe since it refers to the entire IE component and not a part of it (e.g., one of the sensors). - **Action** Healing in this scenario could be applied by creating and sending alert messages for recharging / battery replacement. 3. Network Protocol Violation ----------------------------- - **Scenario** A link-level protocol may operate in unlicensed bands (e.g., WiFi, LoRa); thus, it may have some Duty Cycle (DC) limitations. We could set monitoring agents at the GW to check for potential violation and command the GW to reconfigure the DC value. Typical values of DC include 0.1%, 1%, and 10%. We envisage that the network violation scenario could be set as a family of abnormal scenarios, and we see great value in detecting such problems. - **Action** Healing in this scenario could be applied by enforcing reconfiguration of the IE. 4. Link Quality Issues ---------------------- - **Scenario** In this scenario, radio values of IE communication are reported (e.g., to a Gateway, Base Station, Access Point) and the values are stored. Once these values are dropped below an expected threshold (the threshold is decided based on past values) this is reported to Trust-Manager. - **Action** The healing can be applied by sending commands to the GW to reconfigure the link parameters like the SF and the rate. 5. Communication Failure Indication (no messages received by IE) ---------------------------------------------------------------- - **Scenario** This is a critical failure that cannot be addressed easily, especially if the communication is lost due to network / hw issues at the IE side. However, since an indication of communication failure could be also due to no issue, e.g., because IE has nothing to send! - **Action** Possibly, we could set a dedicated channel (e.g., a wifi connection) for polling (check if alive) messages towards reaching the targeted IE. Place in architecture ===================== Self-healing is one of the two self-* capabilities that can interact directly with IE's hardware, as depicted in Figure 1. .. image:: ./self_capabilities_relationships.png :alt: aerOS self-* capabilities :align: center *Figure 1: aerOS self-\* capabilites* User guide ========== This component provides a FastAPI-based REST API interface that allows retrieving self-healing alerts. The API returns structured alert messages, enabling other components to retrieve failure events efficiently. The following table describes the available API endpoint: +---------+-------------------+-------------------------------------------------------------+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Method | Endpoint | Description | Payload (if need) | Response format | +=========+===================+=============================================================+=============================================================+============================================================================================================================================+ | GET | /alerts | Returns self-healing alert messages since a given timestamp | `since`: Optional timestamp (e.g., "2024-02-19T10:00:00") | { "alerts": [ { "timestamp": "YYYY-MM-DDTHH:MM:SS.ssssss", "scenario": "", "message": "", "mac_address": "" } ] } | +---------+-------------------+-------------------------------------------------------------+-------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ Prerequisities ============== To start using the self-healing module, please visit the `Common deployments `_ repository for more information. The module itself do not require the deployment of other components first (i.e. deploying *self-adaptation or self-optimization* before *self-healing* will not trigger any errors). Once deployed, it will automatically start monitoring IoT devices directly connected to Infrastructure Elements (IEs) and detecting abnormal states associated to typical IoT scenarios. For the complete communication, the component should be deployed within the IE of working with *self-api* and *Trust-manager*. Installation ============ Configuration options ===================== Developer guide =============== To test the module locally: 1. **Download or clone the** `self-healing repository `_: .. code-block:: bash git clone https://gitlab.aeros-project.eu/wp3/t3.5/self-healing 2. **Create and activate a Python virtual environment**: .. code-block:: bash python3 -m venv self-healing-env source venv/bin/activate 3. **Install the required dependencies**: .. code-block:: bash pip install -r requirements.txt 4. **Run the module**: .. code-block:: bash python src/self_healing_app/main.py Notes ----- If you are using a different command to run Python (e.g., python instead of python3, proceed accordingly.) If you are using different OS, you might need a different command to activate the virtual environment. Authors ======= Fogus Innovations & Services License ======= The software is licensed under Apache License v2.0 Notice (dependencies) =====================