##################### FL Training Collector ##################### .. contents:: :local: :depth: 1 Introduction ============ FL Training Collector encapsulates the functionalities of a federated learning (FL) server by synchronizing the training with multiple FL Local Execution instances (clients) and storing the results of the training (final weights of the model and relevant metrics) in the FL Repository (database). Features ======== Place in architecture ===================== FL Training Collector is a component of AI Task Controller and one of auxiliary AI services that can be deployed using general aerOS mechanisms. User guide ========== Interaction with FL Training Collector is done using REST API that is described in section Configuration. Prerequisities ============== Prerequisite is that FL Repository is running and is properly initiated with any models or transformations that AI Local Execution might be needing. Installation ============ FL Training Collector should be deployed as part of AI Task Controller to support AI tasks execution along with AI Local Execution. The service exposes REST API to allow communication with other AI services and with the external parties. FL Training Collector can be run using docker-compose or can be deployed on Kubernetes cluster with a dedicated Helm chart. Running using Docker (locally) ------------------------------ The command can be used in the terminal to build a new Docker image: .. code-block:: bash docker-compose -f docker-compose.yml up --force-recreate --build -d When the container is built and run, it should be checked with command: docker ps. For one instance of AI Local Exceution the output should look like this: +--------------+-------------------------------+-----+--------------+------------------------------------------------+--------------------------------------+ | CONTAINER ID | IMAGE | ... | STATUS | PORTS | NAMES | +==============+===============================+=====+==============+================================================+======================================+ | 8c4744c648c0 | aeros/fl\_training\_collector | ... | Up 5 minutes | 0.0.0.0:8000->8000/tcp, 0.0.0.0:8080->8080/tcp | fl-training-collector-trainingmain-1 | +--------------+-------------------------------+-----+--------------+------------------------------------------------+--------------------------------------+ The Swagger documentation of the REST API should be visible under url: ``ttp://127.0.0.1:8000/docs`` (if default port configuration is preserved) Note 1: When running using Docker make sure that all containers (FL Local Execution, FL Repository, FL Training Collector) that are to be used to run the FL task are in the same network. The following commands can be used: ``docker inspect -f '{{range $key, $value := .NetworkSettings.Networks}}{{$key}} {{end}}' [CONTAINER_ID]`` - check network of a given container ``docker network inspect -f '{{range .Containers}}{{.Name}} {{end}}' [NETWORK e.g. appv0_default]`` - check all containters in a given network ``docker network connect [NETWORK] [CONTAINER_ID]`` - add a given container to a given network Deployment on Kubernetes ------------------------ The FL Training Collector has been developed with the assumption that it will be deployed on Kubernetes with a dedicated Helm chart. To do it, just run ``helm install trainingcollector``. To make sure that it has been configured properly, check if the values like `REPOSITORY_ADDRESS` (indicating the address under which the FL Repository can be found in the Kubernetes cluster) or ``CONTROLLER_ADDRESS`` in the ``training-collector-configmap`` have been properly set. By default, the chart also uses the host's ports 30800 and 30808 as Node Ports. The API with Swagger documentation is available at ``http://127.0.0.1:30800/docs``. Configuration options ===================== **POST /job/config/{training_id}/** Starts a training with configuration received in JSON file. .. code-block:: JSON { "strategy": "string", "model_name": "string", "model_version": "string", "adapt_config": "string", "server_conf": { "num_rounds": 0, "round_timeout": 0 }, "strategy_conf": { "fraction_fit": 0, "fraction_evaluate": 0, "min_fit_clients": 0, "min_evaluate_clients": 0, "min_available_clients": 0, "accept_failures": true, "server_learning_rate": 0, "server_momentum": 0, "min_completion_rate_fit": 0, "min_completion_rate_evaluate": 0, "eta": 0, "eta_l": 0, "beta_1": 0, "beta_2": 0, "tau": 0, "q_param": 0, "qffl_learning_rate": 0 }, "client_conf": [ { "config_id": "string", "batch_size": 32, "steps_per_epoch": 3, "epochs": 0, "learning_rate": 0.05 } ], "privacy-mechanisms": {}, "configuration_id": 0, "stopping_flag": false, "stopping_target": { "metric": 0, } } The definitions: - **strategy** The name that the strategy is stored under in the FL Repository. - **model_name** The name of the model that should be used in the training (as described in the FL Repository) - **model_version** The version of the model that should be used in the training (as described in the FL Repository) - **adapt_conf** The parameter indicating whether the learning rate should be adapted on the FL Training Collector. It is currently preferable to change this aspect through the use of FL strategies. - **server_conf** The configuration designated to be used by the underlying [Flower](https://github.com/adap/flower) server. - **num_rounds** For how many rounds should run the FL training. - **timeout** Whether the FL server should wait for the results of all of its configured clients or stop after a timeout. Compatible with the server timeout in Flower. - **strategy_conf** The parameters that allow for the flexible configuration of one of the three aggregation strategies offered out-of-the-box (other strategies can also be added to the FL Repository and ran, but they should have already preconfigured the parameters that are specific to them). More about the meaning of the specific fields can be found in the Flower documentation. - **client_conf** The basic parameters that can be sent to configure the client from the FL server. Currently not used in the three default strategies, but preserved to be used by the custom strategies if applicable. - **configuration_id** This describes the id of the configuration the results of this training will be stored under. - **stopping_flag** This flag indicates whether the training should stop when one of the aggregated evaluation metrics reaches a specific value. - **stopping_target** This specifies the values of the aggregated metrics that, if one of them is surpassed, will cause the whole training process to stop gracefully (saving the results of the training process in the FL Repository beforehand) A sample test configuration can be seen here: .. code-block:: JSON { "strategy": "avg", "model_name": "md_keras", "model_version": "v1", "adapt_config": "custom", "server_conf": { "num_rounds": 3 }, "strategy_conf": { "min_fit_clients" : "1", "min_available_clients": "1", "min_evaluate_clients": "1" }, "privacy-mechanisms":{ }, "client_conf": [ { "config_id" : "min_effort", "batch_size": "64", "steps_per_epoch" : "32", "epochs" : "5", "learning_rate" : "0.001" } ], "configuration_id": "1", "stopping_flag":true, "stopping_target": {"accuracy":0.25} } **GET /job/status/{training_id}** Returns information about the status of the job. The status may show that the training is `INACTIVE`, `WAITING`, `TRAINING`, `INTERRUPTED` or `FINISHED`. Information about the round number may also be specified if appropriate. **POST /job/stop** Stops (ungracefully) a given job. Developer guide =============== Authors ======= The FL Training Collector service is a continuation of research conducted within Horizon 2020 ASSIST-IoT project. `Systems Research Institute, Polish Academy of Sciences, Warsaw `__ License ======= The FL Training Collector is released under the Apache 2.0 license. You may obtain a copy of the License at: [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0) Notice (dependencies) =====================