##################### AI Local Execution ##################### .. contents:: :local: :depth: 1 Introduction ============ The AI Local Executor is a service that enables execution of AI tasks on Infrastructure Elements. It is part of a set of AI services (with AI Task Controller having components: FL Training Collector, FL Repository, FL Controller) that should be deployed in order to run a distributed AI task. Ideally all services should be deployed but it is possible to run the task directly on AI Local Execution configuring it through its dedicated REST API. Features ======== AI Local Execution encapsulates functionalities of a federated learning (FL) client connecting to the training initiated by the FL server (FL Training Collector), periodically providing it with local weights and obtaining new global weights, as well as downloading any necessary components from the FL Repository (database). Beyond the classic functionality of an FL client, AI Local Execution enables the local inference deployment of a selected model (that can function as a standalone container). It uses flexible configurations, basic format verification and pluggable components. Place in architecture ===================== AI Local Execution is auxiliary AI service that can be deployed using general aerOS mechanisms. User guide ========== TBD Prerequisities ============== Prerequisite is that FL Repository is running and is properly initiated with any models or transformations that AI Local Execution might be needing. Installation ============ AI Local Execution should be deployed along with AI Task Controller components to provide full functionality of FL training and local inference. The respective services expose REST API to allow communication between themselves and with the external parties. AI Local Execution can be run using docker-compose or can be deployed on Kubernetes cluster with a dedicated Helm chart. Running using Docker (locally) ------------------------------ The command can be used in the terminal to build a new Docker image: .. code-block:: bash USER_INDEX=1 FL_LOCAL_OP_DATA_FOLDER="./data" docker compose up --force-recreate --build -d Alternatively, the script can be used to do this automatically. .. code-block:: bash sh start-local-docker.sh [NUMBER OF LE] Setting [NUMBER OF LE] to 1 will run one AI Local Execution instance. When the container is built and run, it should be checked with command: docker ps. For one instance of AI Local Exceution the output should look like this: +--------------+---------------------------------------------+-----+--------------+-----------------------------------------------------------------------+--------------------------+ | CONTAINER ID | IMAGE | ... | STATUS | PORTS | NAME | +==============+=============================================+=====+==============+=======================================================================+==========================+ | 8c4744c648c0 | aeros/ai\_local\_executor:latest | ... | Up 5 minutes | 0.0.0.0:9050->9050/tcp, 0.0.0.0:30080->80/tcp, 0.0.0.0:9003->9000/tcp | appv0\-local\_executor-1 | +--------------+---------------------------------------------+-----+--------------+-----------------------------------------------------------------------+--------------------------+ | 24964eafadc9 | aeros/ai\_local\_executor\_inference:latest | ... | Up 5 minutes | 0.0.0.0:9001->9000/tcp, 0.0.0.0:50052->50051/tcp | appv0-inferenceapp-1 | +--------------+---------------------------------------------+-----+--------------+-----------------------------------------------------------------------+--------------------------+ | 57de2b0a09e4 | appv0-db | ... | Up 5 minutes | 27017/tcp | db-0 | +--------------+---------------------------------------------+-----+--------------+-----------------------------------------------------------------------+--------------------------+ The Swagger documentation of the REST API should be visible under url: http://127.0.0.1:9050/docs (if default port configuration is preserved) Note 1: When running using Docker make sure that all other containers (FL Local Execution, FL Repository, FL Training Collector) that are to be used to run FL task are in the same network. The following commands can be used: ``docker inspect -f '{{range $key, $value := .NetworkSettings.Networks}}{{$key}} {{end}}' [CONTAINER_ID]`` - check network of a given container ``docker network inspect -f '{{range .Containers}}{{.Name}} {{end}}' [NETWORK e.g. appv0_default]`` - check all containters in a given network ``docker network connect [NETWORK] [CONTAINER_ID]`` - add a given container to a given network Note 2: Before starting the training at least FL Repository should be properly initiated with any required transformations and model metadata and started. For more details see FL Repository documentation. Deployment on Kubernetes ------------------------ The AI Local Execution service has been developed with the assumption that it will be deployed on a Kubernetes cluster with a dedicated Helm chart. To do so, run ``helm install ailocalexecution``. If multiple AI Local Executions should be deployed in one Kubernetes cluster, different names for all of the deployments should be chosen. If only the inference component should be deployed, run ``helm install fllocalexecutionlocal fllocalexecution --set inferenceapp.fullDeployment.enabled=false``. To make sure that the service has been configured properly, check the 3 ConfigMaps that are deployed alongside the service. Their names change depending on the name od the deployment. 1. starts with ``ailocalexec-config-map`` contains the environmental variables necessary to deploy the AI Local Execution instance. Check especially the fields of ``REPOSITORY\_ADDRESS`` (the address of the FL Repository instance), ``CONTROLLER\_SVR\_ADDRESS`` (the address of the FL Controller). 2. starts with ``fltraining-config-map`` - describes the configuration necessary to run the training app component with pluggable transformations. 3. starts with ``flinference-config-map`` - serves to flexibly set and change the configuration for the inference component. Configuration options ===================== - **POST /model/** Receive new training model metadata for local storage - **PUT /model/{name}/{version}** Update the weights and structure of the locally stored training model. - **GET /job/status** Get the statuses of the current jobs. - **GET /job/total** Get the number of currently running jobs. - **GET /capabilities** Get the computational capabilities of the machine that AI Local Execution is running on. - **GET /format** Get the format of the data that a given instance has currently access to. In order to initiate the training, a JSON encompassing the following configuration should be sent to the endpoint shown below. The most important available keys and their meaning will be explained further down. **POST /job/config/{training_id}/** .. code-block:: json { "client_type_id": "string", "server_address": "string", "eval_metrics": [ "string" ], "eval_func": "string", "num_classes": 0, "num_rounds": 0, "shape": [ 0 ], "training_id": 0, "model_name": "string", "model_version": "string", "config": [ { "config_id": "string", "batch_size": 0, "steps_per_epoch": 0, "epochs": 0, "learning_rate": 0 } ], "optimizer_config": { "optimizer": "string", "lr": 0, "rho": 0, "eps": 0, "foreach": true, "maximize": true, "lr_decay": 0, "betas": [ "string", "string" ], "etas": [ "string", "string" ], "step_sizes": [ "string", "string" ], "lambd": 0, "alpha": 0, "t0": 0, "max_iter": 0, "max_eval": 0, "tolerance_grad": 0, "tolerance_change": 0, "history_size": 0, "line_search_fn": "string", "momentum_decay": 0, "dampening": 0, "centered": true, "nesterov": true, "momentum": 0, "weight_decay": 0, "amsgrad": true, "learning_rate": 0, "name": "string", "clipnorm": 0, "global_clipnorm": 0, "use_ema": true, "ema_momentum": 0, "ema_overwrite_frequency": 0, "jit_compile": true, "epsilon": 0, "clipvalue": 0, "initial_accumulator_value": 0, "beta_1": 0, "beta_2": 0, "beta_2_decay": 0, "epsilon_1": 0, "epsilon_2": 0, "learning_rate_power": 0, "l1_regularization_strength": 0, "l2_regularization_strength": 0, "l2_shrinkage_regularization_strength": 0, "beta": 0 }, "scheduler_config": { "scheduler": "string", "step_size": 0, "gamma": 0, "last_epoch": 0, "verbose": true, "milestones": [ 0 ], "factor": 0, "total_iters": 0, "start_factor": 0, "end_factor": 0, "monitor": "string", "min_delta": 0, "patience": 0, "mode": "string", "baseline": 0, "restore_best_weights": true, "start_from_epoch": 0, "cooldown": 0, "min_lr": 0 }, "warmup_config": { "scheduler": "string", "warmup_iters": 0, "warmup_epochs": 0, "warmup_factor": 0, "scheduler_conf": { "scheduler": "string", "step_size": 0, "gamma": 0, "last_epoch": 0, "verbose": true, "milestones": [ 0 ], "factor": 0, "total_iters": 0, "start_factor": 0, "end_factor": 0, "monitor": "string", "min_delta": 0, "patience": 0, "mode": "string", "baseline": 0, "restore_best_weights": true, "start_from_epoch": 0, "cooldown": 0, "min_lr": 0 } }, "privacy-mechanisms": {}} The definitions: - **client_type_id** Specifies the ID of the client. Allows to bypass the plugability modules for the Pytorch builder with the keyword "base" for testing purposes. - **server_address** The address of the Flower server that the FL client should try to connect to. - **eval_metrics** The evaluation metrics which will be gathered through the evaluation process by the FL client. - **eval_func** The evaluation function that the model will use as the loss throughout the training process. - **num_classes** The number of classes in classification problems. - **num_rounds** The number of rounds that the training should run for. - **shape** The shape of the data. Currently, this parameter is recommended to be changed through the ConfigMaps instead. - **training_id** The id of the training process being conducted. - **model_name** The name of the model that will be used in the training. The name should be the same as the one stored in FL Repository. - **model_version** The version of the model that will be used in the training. The name should be the same as the one stored in the FL Repository. - **config** The configuration specifying how the FL training process will be conducted on the client, containing important terms such as the batch_size or learning rate. - **optimizer_config** The configuration of the optimizer. - **optimizer** The name of the optimizer. For the Keras model and client, the optimizer can be one of: .. code-block:: text "sgd": tf.keras.optimizers.SGD, "rmsprop": tf.keras.optimizers.RMSprop, "adam": tf.keras.optimizers.Adam, "adadelta": tf.keras.optimizers.Adadelta, "adagrad": tf.keras.optimizers.Adagrad, "adamax": tf.keras.optimizers.Adamax, "nadam": tf.keras.optimizers.Nadam, "ftrl": tf.keras.optimizers.Ftrl For the PyTorch model and client, the optimizer can be one of: .. code-block:: text "adadelta": torch.optim.Adadelta, "adagrad": torch.optim.Adagrad, "adam": torch.optim.Adam, "adamw": torch.optim.AdamW, "sparseadam": torch.optim.SparseAdam, "adamax": torch.optim.Adamax, "asgd": torch.optim.ASGD, "lbfgs": torch.optim.LBFGS, "nadam": torch.optim.NAdam, "radam": torch.optim.RAdam, "rmsprop": torch.optim.RMSprop, "rprop": torch.optim.Rprop, "sgd": torch.optim.SGD Other fields indicate the arguments that should be passed to the optimizer. - **scheduler_config** The configuration of the scheduler. - **scheduler** The name of the scheduler. For the Keras model and client, the scheduler (or here, a more appropriate name would be a Keras callback) can be one of: .. code-block:: text "earlystopping": tf.keras.callbacks.EarlyStopping, "reducelronplateau": tf.keras.callbacks.ReduceLROnPlateau, "terminateonnan": tf.keras.callbacks.TerminateOnNaN For the Pytorch model and client, the scheduler can be one of: .. code-block:: text "lambdalr": torch.optim.lr_scheduler.LambdaLR, "multiplicativelr": torch.optim.lr_scheduler.MultiplicativeLR, "steplr": torch.optim.lr_scheduler.StepLR, "multisteplr": torch.optim.lr_scheduler.MultiStepLR, "constantlr": torch.optim.lr_scheduler.ConstantLR, "linearlr": torch.optim.lr_scheduler.LinearLR, "exponentiallr": torch.optim.lr_scheduler.ExponentialLR, "cosineannealinglr": torch.optim.lr_scheduler.CosineAnnealingLR, "chainedscheduler": torch.optim.lr_scheduler.ChainedScheduler, "sequentiallr": torch.optim.lr_scheduler.SequentialLR, "reducelronplateau": torch.optim.lr_scheduler.ReduceLROnPlateau, "cycliclr": torch.optim.lr_scheduler.CyclicLR, "onecyclelr": torch.optim.lr_scheduler.OneCycleLR, "cosineannealingwarmrestarts": torch.optim.lr_scheduler.CosineAnnealingWarmRestarts Other fields indicate the arguments that should be passed to the scheduler. - **warmup_config** The configuration of an (optional) warmup. This configuration is valid only for the PyTorch builder. It specifies a special scheduler, which can be used only for a selected number of epochs to provide warmup throughout the process. A sample test configuration can be seen here: .. code-block:: json {"client_type_id": "local1", "server_address": "fl-training-collector-trainingmain-1", "eval_metrics": [ "accuracy" ], "eval_func": "categorical_crossentropy", "num_classes": 10, "num_rounds": 15, "shape": [ 32, 32, 3 ], "training_id": "1", "model_name": "md_keras", "model_version": "v1", "config": [ {"config_id": "min_effort", "batch_size": "64", "steps_per_epoch": "32", "epochs": "1", "learning_rate": "0.001"} ], "optimizer_config": { "optimizer": "adam", "learning_rate":"0.005", "amsgrad":"True" }, "scheduler_config": { "scheduler": "reducelronplateau", "factor":"0.5", "min_delta":"0.0003" }, "privacy-mechanisms":{}} Developer guide =============== TBD Authors ======= The AI Local Execution service is a continuation of research conducted within Horizon 2020 ASSIST-IoT project. `Systems Research Institute, Polish Academy of Sciences, Warsaw `__ License ======= The AI Local Execution is released under the Apache 2.0 license (available at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)), as we have internally concluded that we are not “offering the functionality of MongoDB, or modified versions of MongoDB, to third parties as a service”. However, potential future commercial adopters should be aware that our project uses MongoDB in order to be able to accurately determine the license most applicable to their projects. Notice (dependencies) =====================