FL Training Collector

Introduction 

FL Training Collector encapsulates the functionalities of a federated learning (FL) server by synchronizing the training with multiple FL Local Execution instances (clients) and storing the results of the training (final weights of the model and relevant metrics) in the FL Repository (database).

Features 

Place in architecture 

FL Training Collector is a component of AI Task Controller and one of auxiliary AI services that can be deployed using general aerOS mechanisms.

User guide 

Interaction with FL Training Collector is done using REST API that is described in section Configuration.

Prerequisities 

Prerequisite is that FL Repository is running and is properly initiated with any models or transformations that AI Local Execution might be needing.

Installation 

FL Training Collector should be deployed as part of AI Task Controller to support AI tasks execution along with AI Local Execution. The service exposes REST API to allow communication with other AI services and with the external parties.

FL Training Collector can be run using docker-compose or can be deployed on Kubernetes cluster with a dedicated Helm chart.

Running using Docker (locally)

The command can be used in the terminal to build a new Docker image:

docker-compose -f docker-compose.yml up --force-recreate --build -d

When the container is built and run, it should be checked with command: docker ps. For one instance of AI Local Exceution the output should look like this:

CONTAINER ID	IMAGE	…	STATUS	PORTS	NAMES
8c4744c648c0	aeros/fl_training_collector	…	Up 5 minutes	0.0.0.0:8000->8000/tcp, 0.0.0.0:8080->8080/tcp	fl-training-collector-trainingmain-1

The Swagger documentation of the REST API should be visible under url: ttp://127.0.0.1:8000/docs (if default port configuration is preserved)

Note 1: When running using Docker make sure that all containers (FL Local Execution, FL Repository, FL Training Collector) that are to be used to run the FL task are in the same network.

The following commands can be used:

docker inspect -f '{{range $key, $value := .NetworkSettings.Networks}}{{$key}} {{end}}' [CONTAINER_ID] - check network of a given container

docker network inspect -f '{{range .Containers}}{{.Name}} {{end}}' [NETWORK e.g. appv0_default] - check all containters in a given network

docker network connect [NETWORK] [CONTAINER_ID] - add a given container to a given network

Deployment on Kubernetes

The FL Training Collector has been developed with the assumption that it will be deployed on Kubernetes with a dedicated Helm chart. To do it, just run helm install <deployment name> trainingcollector. To make sure that it has been configured properly, check if the values like REPOSITORY_ADDRESS (indicating the address under which the FL Repository can be found in the Kubernetes cluster) or CONTROLLER_ADDRESS in the training-collector-configmap have been properly set. By default, the chart also uses the host’s ports 30800 and 30808 as Node Ports.

The API with Swagger documentation is available at http://127.0.0.1:30800/docs.

Configuration options 

POST /job/config/{training_id}/ Starts a training with configuration received in JSON file.

{
  "strategy": "string",
  "model_name": "string",
  "model_version": "string",
  "adapt_config": "string",
  "server_conf": {
    "num_rounds": 0,
    "round_timeout": 0
  },
  "strategy_conf": {
    "fraction_fit": 0,
    "fraction_evaluate": 0,
    "min_fit_clients": 0,
    "min_evaluate_clients": 0,
    "min_available_clients": 0,
    "accept_failures": true,
    "server_learning_rate": 0,
    "server_momentum": 0,
    "min_completion_rate_fit": 0,
    "min_completion_rate_evaluate": 0,
    "eta": 0,
    "eta_l": 0,
    "beta_1": 0,
    "beta_2": 0,
    "tau": 0,
    "q_param": 0,
    "qffl_learning_rate": 0
  },
  "client_conf": [
    {
      "config_id": "string",
      "batch_size": 32,
      "steps_per_epoch": 3,
      "epochs": 0,
      "learning_rate": 0.05
    }
  ],
  "privacy-mechanisms": {},
  "configuration_id": 0,
  "stopping_flag": false,
  "stopping_target": {
    "metric": 0,
  }
}

The definitions:

strategy The name that the strategy is stored under in the FL Repository.
model_name The name of the model that should be used in the training (as described in the FL Repository)
model_version The version of the model that should be used in the training (as described in the FL Repository)
adapt_conf The parameter indicating whether the learning rate should be adapted on the FL Training Collector. It is currently preferable to change this aspect through the use of FL strategies.
server_conf The configuration designated to be used by the underlying [Flower](https://github.com/adap/flower) server.
num_rounds For how many rounds should run the FL training.
timeout Whether the FL server should wait for the results of all of its configured clients or stop after a timeout. Compatible with the server timeout in Flower.
strategy_conf The parameters that allow for the flexible configuration of one of the three aggregation strategies offered out-of-the-box (other strategies can also be added to the FL Repository and ran, but they should have already preconfigured the parameters that are specific to them). More about the meaning of the specific fields can be found in the Flower documentation.
client_conf The basic parameters that can be sent to configure the client from the FL server. Currently not used in the three default strategies, but preserved to be used by the custom strategies if applicable.
configuration_id This describes the id of the configuration the results of this training will be stored under.
stopping_flag This flag indicates whether the training should stop when one of the aggregated evaluation metrics reaches a specific value.
stopping_target This specifies the values of the aggregated metrics that, if one of them is surpassed, will cause the whole training process to stop gracefully (saving the results of the training process in the FL Repository beforehand)

A sample test configuration can be seen here:

{
  "strategy": "avg",
  "model_name": "md_keras",
  "model_version": "v1",
  "adapt_config": "custom",
  "server_conf": {
    "num_rounds": 3
  },
  "strategy_conf": {
    "min_fit_clients" : "1",
    "min_available_clients": "1",
    "min_evaluate_clients": "1"
  },
"privacy-mechanisms":{
},
  "client_conf": [
    {
      "config_id" : "min_effort",
      "batch_size": "64",
      "steps_per_epoch" : "32",
      "epochs" : "5",
      "learning_rate" : "0.001"
    }
  ],
  "configuration_id": "1",
  "stopping_flag":true,
  "stopping_target": {"accuracy":0.25}
}

GET /job/status/{training_id} Returns information about the status of the job. The status may show that the training is INACTIVE, WAITING, TRAINING, INTERRUPTED or FINISHED. Information about the round number may also be specified if appropriate.

POST /job/stop Stops (ungracefully) a given job.