AI Local Execution

Introduction 

The AI Local Executor is a service that enables execution of AI tasks on Infrastructure Elements. It is part of a set of AI services (with AI Task Controller having components: FL Training Collector, FL Repository, FL Controller) that should be deployed in order to run a distributed AI task. Ideally all services should be deployed but it is possible to run the task directly on AI Local Execution configuring it through its dedicated REST API.

Features 

AI Local Execution encapsulates functionalities of a federated learning (FL) client connecting to the training initiated by the FL server (FL Training Collector), periodically providing it with local weights and obtaining new global weights, as well as downloading any necessary components from the FL Repository (database).

Beyond the classic functionality of an FL client, AI Local Execution enables the local inference deployment of a selected model (that can function as a standalone container). It uses flexible configurations, basic format verification and pluggable components.

Place in architecture 

AI Local Execution is auxiliary AI service that can be deployed using general aerOS mechanisms.

User guide 

TBD

Prerequisities 

Prerequisite is that FL Repository is running and is properly initiated with any models or transformations that AI Local Execution might be needing.

Installation 

AI Local Execution should be deployed along with AI Task Controller components to provide full functionality of FL training and local inference. The respective services expose REST API to allow communication between themselves and with the external parties.

AI Local Execution can be run using docker-compose or can be deployed on Kubernetes cluster with a dedicated Helm chart.

Running using Docker (locally)

The command can be used in the terminal to build a new Docker image:

USER_INDEX=1 FL_LOCAL_OP_DATA_FOLDER="./data" docker compose up --force-recreate --build -d

Alternatively, the script can be used to do this automatically.

sh start-local-docker.sh [NUMBER OF LE]

Setting [NUMBER OF LE] to 1 will run one AI Local Execution instance.

When the container is built and run, it should be checked with command: docker ps. For one instance of AI Local Exceution the output should look like this:

CONTAINER ID	IMAGE	…	STATUS	PORTS	NAME
8c4744c648c0	aeros/ai_local_executor:latest	…	Up 5 minutes	0.0.0.0:9050->9050/tcp, 0.0.0.0:30080->80/tcp, 0.0.0.0:9003->9000/tcp	appv0-local_executor-1
24964eafadc9	aeros/ai_local_executor_inference:latest	…	Up 5 minutes	0.0.0.0:9001->9000/tcp, 0.0.0.0:50052->50051/tcp	appv0-inferenceapp-1
57de2b0a09e4	appv0-db	…	Up 5 minutes	27017/tcp	db-0

The Swagger documentation of the REST API should be visible under url: http://127.0.0.1:9050/docs (if default port configuration is preserved)

Note 1: When running using Docker make sure that all other containers (FL Local Execution, FL Repository, FL Training Collector) that are to be used to run FL task are in the same network.

The following commands can be used:

docker inspect -f '{{range $key, $value := .NetworkSettings.Networks}}{{$key}} {{end}}' [CONTAINER_ID] - check network of a given container

docker network inspect -f '{{range .Containers}}{{.Name}} {{end}}' [NETWORK e.g. appv0_default] - check all containters in a given network

docker network connect [NETWORK] [CONTAINER_ID] - add a given container to a given network

Note 2: Before starting the training at least FL Repository should be properly initiated with any required transformations and model metadata and started. For more details see FL Repository documentation.

Deployment on Kubernetes

The AI Local Execution service has been developed with the assumption that it will be deployed on a Kubernetes cluster with a dedicated Helm chart. To do so, run helm install <deployment name> ailocalexecution. If multiple AI Local Executions should be deployed in one Kubernetes cluster, different names for all of the deployments should be chosen. If only the inference component should be deployed, run helm install fllocalexecutionlocal fllocalexecution --set inferenceapp.fullDeployment.enabled=false.

To make sure that the service has been configured properly, check the 3 ConfigMaps that are deployed alongside the service. Their names change depending on the name od the deployment.

starts with ailocalexec-config-map contains the environmental variables necessary to deploy the AI Local Execution instance. Check especially the fields of REPOSITORY\_ADDRESS (the address of the FL Repository instance), CONTROLLER\_SVR\_ADDRESS (the address of the FL Controller).
starts with fltraining-config-map - describes the configuration necessary to run the training app component with pluggable transformations.
starts with flinference-config-map - serves to flexibly set and change the configuration for the inference component.

Configuration options 

POST /model/ Receive new training model metadata for local storage
PUT /model/{name}/{version} Update the weights and structure of the locally stored training model.
GET /job/status Get the statuses of the current jobs.
GET /job/total Get the number of currently running jobs.
GET /capabilities Get the computational capabilities of the machine that AI Local Execution is running on.
GET /format Get the format of the data that a given instance has currently access to.

In order to initiate the training, a JSON encompassing the following configuration should be sent to the endpoint shown below. The most important available keys and their meaning will be explained further down.

POST /job/config/{training_id}/

{
  "client_type_id": "string",
  "server_address": "string",
  "eval_metrics": [
    "string"
  ],
  "eval_func": "string",
  "num_classes": 0,
  "num_rounds": 0,
  "shape": [
    0
  ],
  "training_id": 0,
  "model_name": "string",
  "model_version": "string",
  "config": [
    {
      "config_id": "string",
      "batch_size": 0,
      "steps_per_epoch": 0,
      "epochs": 0,
      "learning_rate": 0
    }
  ],
  "optimizer_config": {
    "optimizer": "string",
    "lr": 0,
    "rho": 0,
    "eps": 0,
    "foreach": true,
    "maximize": true,
    "lr_decay": 0,
    "betas": [
      "string",
      "string"
    ],
    "etas": [
      "string",
      "string"
    ],
    "step_sizes": [
      "string",
      "string"
    ],
    "lambd": 0,
    "alpha": 0,
    "t0": 0,
    "max_iter": 0,
    "max_eval": 0,
    "tolerance_grad": 0,
    "tolerance_change": 0,
    "history_size": 0,
    "line_search_fn": "string",
    "momentum_decay": 0,
    "dampening": 0,
    "centered": true,
    "nesterov": true,
    "momentum": 0,
    "weight_decay": 0,
    "amsgrad": true,
    "learning_rate": 0,
    "name": "string",
    "clipnorm": 0,
    "global_clipnorm": 0,
    "use_ema": true,
    "ema_momentum": 0,
    "ema_overwrite_frequency": 0,
    "jit_compile": true,
    "epsilon": 0,
    "clipvalue": 0,
    "initial_accumulator_value": 0,
    "beta_1": 0,
    "beta_2": 0,
    "beta_2_decay": 0,
    "epsilon_1": 0,
    "epsilon_2": 0,
    "learning_rate_power": 0,
    "l1_regularization_strength": 0,
    "l2_regularization_strength": 0,
    "l2_shrinkage_regularization_strength": 0,
    "beta": 0
  },
  "scheduler_config": {
    "scheduler": "string",
    "step_size": 0,
    "gamma": 0,
    "last_epoch": 0,
    "verbose": true,
    "milestones": [
      0
    ],
    "factor": 0,
    "total_iters": 0,
    "start_factor": 0,
    "end_factor": 0,
    "monitor": "string",
    "min_delta": 0,
    "patience": 0,
    "mode": "string",
    "baseline": 0,
    "restore_best_weights": true,
    "start_from_epoch": 0,
    "cooldown": 0,
    "min_lr": 0
  },
  "warmup_config": {
    "scheduler": "string",
    "warmup_iters": 0,
    "warmup_epochs": 0,
    "warmup_factor": 0,
    "scheduler_conf": {
      "scheduler": "string",
      "step_size": 0,
      "gamma": 0,
      "last_epoch": 0,
      "verbose": true,
      "milestones": [
        0
      ],
      "factor": 0,
      "total_iters": 0,
      "start_factor": 0,
      "end_factor": 0,
      "monitor": "string",
      "min_delta": 0,
      "patience": 0,
      "mode": "string",
      "baseline": 0,
      "restore_best_weights": true,
      "start_from_epoch": 0,
      "cooldown": 0,
      "min_lr": 0
    }
  },
  "privacy-mechanisms": {}}

The definitions:

client_type_id Specifies the ID of the client. Allows to bypass the plugability modules for the Pytorch builder with the keyword “base” for testing purposes.
server_address The address of the Flower server that the FL client should try to connect to.
eval_metrics The evaluation metrics which will be gathered through the evaluation process by the FL client.
eval_func The evaluation function that the model will use as the loss throughout the training process.
num_classes The number of classes in classification problems.
num_rounds The number of rounds that the training should run for.
shape The shape of the data. Currently, this parameter is recommended to be changed through the ConfigMaps instead.
training_id The id of the training process being conducted.
model_name The name of the model that will be used in the training. The name should be the same as the one stored in FL Repository.
model_version The version of the model that will be used in the training. The name should be the same as the one stored in the FL Repository.
config The configuration specifying how the FL training process will be conducted on the client, containing important terms such as the batch_size or learning rate.
optimizer_config The configuration of the optimizer.

optimizer The name of the optimizer.

For the Keras model and client, the optimizer can be one of:

"sgd": tf.keras.optimizers.SGD,
"rmsprop": tf.keras.optimizers.RMSprop,
"adam": tf.keras.optimizers.Adam,
"adadelta": tf.keras.optimizers.Adadelta,
"adagrad": tf.keras.optimizers.Adagrad,
"adamax": tf.keras.optimizers.Adamax,
"nadam": tf.keras.optimizers.Nadam,
"ftrl": tf.keras.optimizers.Ftrl

For the PyTorch model and client, the optimizer can be one of:

"adadelta": torch.optim.Adadelta,
"adagrad": torch.optim.Adagrad,
"adam": torch.optim.Adam,
"adamw": torch.optim.AdamW,
"sparseadam": torch.optim.SparseAdam,
"adamax": torch.optim.Adamax,
"asgd": torch.optim.ASGD,
"lbfgs": torch.optim.LBFGS,
"nadam": torch.optim.NAdam,
"radam": torch.optim.RAdam,
"rmsprop": torch.optim.RMSprop,
"rprop": torch.optim.Rprop,
"sgd": torch.optim.SGD

Other fields indicate the arguments that should be passed to the optimizer.

scheduler_config The configuration of the scheduler.

scheduler The name of the scheduler.

For the Keras model and client, the scheduler (or here, a more appropriate name would be a Keras callback) can be one of:

"earlystopping": tf.keras.callbacks.EarlyStopping,
"reducelronplateau": tf.keras.callbacks.ReduceLROnPlateau,
"terminateonnan": tf.keras.callbacks.TerminateOnNaN

For the Pytorch model and client, the scheduler can be one of:

"lambdalr": torch.optim.lr_scheduler.LambdaLR,
"multiplicativelr": torch.optim.lr_scheduler.MultiplicativeLR,
"steplr": torch.optim.lr_scheduler.StepLR,
"multisteplr": torch.optim.lr_scheduler.MultiStepLR,
"constantlr": torch.optim.lr_scheduler.ConstantLR,
"linearlr": torch.optim.lr_scheduler.LinearLR,
"exponentiallr": torch.optim.lr_scheduler.ExponentialLR,
"cosineannealinglr": torch.optim.lr_scheduler.CosineAnnealingLR,
"chainedscheduler": torch.optim.lr_scheduler.ChainedScheduler,
"sequentiallr": torch.optim.lr_scheduler.SequentialLR,
"reducelronplateau": torch.optim.lr_scheduler.ReduceLROnPlateau,
"cycliclr": torch.optim.lr_scheduler.CyclicLR,
"onecyclelr": torch.optim.lr_scheduler.OneCycleLR,
"cosineannealingwarmrestarts": torch.optim.lr_scheduler.CosineAnnealingWarmRestarts

Other fields indicate the arguments that should be passed to the scheduler.

warmup_config The configuration of an (optional) warmup. This configuration is valid only for the PyTorch builder. It specifies a special scheduler, which can be used only for a selected number of epochs to provide warmup throughout the process.

A sample test configuration can be seen here:

{"client_type_id": "local1",
  "server_address": "fl-training-collector-trainingmain-1",
  "eval_metrics": [
    "accuracy"
  ],
  "eval_func": "categorical_crossentropy",
  "num_classes": 10,
  "num_rounds": 15,
  "shape": [
    32, 32, 3
  ],
  "training_id": "1",
  "model_name": "md_keras",
  "model_version": "v1",
  "config": [
    {"config_id": "min_effort",
   "batch_size": "64",
   "steps_per_epoch": "32",
   "epochs": "1",
   "learning_rate": "0.001"}
  ],
  "optimizer_config": {
    "optimizer": "adam",
    "learning_rate":"0.005",
    "amsgrad":"True"
  },
  "scheduler_config": {
    "scheduler": "reducelronplateau",
    "factor":"0.5",
    "min_delta":"0.0003"
  },
  "privacy-mechanisms":{}}

Developer guide 

TBD

Authors 

The AI Local Execution service is a continuation of research conducted within Horizon 2020 ASSIST-IoT project.

Systems Research Institute, Polish Academy of Sciences, Warsaw

License 

The AI Local Execution is released under the Apache 2.0 license (available at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)), as we have internally concluded that we are not “offering the functionality of MongoDB, or modified versions of MongoDB, to third parties as a service”. However, potential future commercial adopters should be aware that our project uses MongoDB in order to be able to accurately determine the license most applicable to their projects.