Managing a Cluster of 100 Edge Devices

En route to eradicating the need for cloud in any AI training and retraining processes. Part 2 of 3.

Live view of our cluster of edge devices, running a live distributed and collaborative training.

If you haven’t read the first post in this series, we suggest that you read it in order to get the most benefit from this article. It described our process for building a cluster of 100 edge devices, in order to train deep-learning and machine-learning models in a distributed manner using a cluster of edge devices, without requiring the use of the cloud while achieving near-perfect accuracy. This post (the second of three) focuses on how to manage this type of system — from device management to deployment.

This trial consists of a rack of 100 edge devices — Intel NUC single-board computers (SBCs). Our goal was to train a deep-learning model on top of these 100 edge devices. To that end, we needed to create a smooth pipeline and simple infrastructure. A key guiding principle was to ensure that we had full programmatic control on the rack and its devices — from the cluster’s initial boot to ongoing, day-to-day maintenance.

The main components are:

VMware (a global leader in the virtualization market) provides a leading enterprise-class, bare-metal, type-1 hypervisor for deploying and serving virtual computers.

It stands out in its ability to run multiple Virtual Machines (VMs) on a single physical server. Its free edition can run on our PowerEdge T430 server in order to enable central management of all the VMs.

Note: The minor limitations of the free edition (such as automatic backups) are not critical for our scenario.

MAAS is designed to facilitate and automate the deployment and dynamic provisioning of hyperscale computing environments, such as big data workloads and cloud services.

We used MAAS as a server provisioning application in order to remotely deploy operating systems (OSs) on bare metal hardware using PXE as the deployment mechanism.

MAAS is able to remotely deploy OSs to multiple targets and provide central management of all targets. It serves as DHCP and DNS for the network and provides customizable OS startup scripts (such as CloudInit), as well as discovery of other networking devices on the network.

MAAS’s benefits enabled us to create a customized OS (Ubuntu 18.04 with chef-client) in order to remotely deploy on our 100 edge devices (Intel NUC). MAAS is responsible for storing OS images to be deployed to the edge devices via LAN (PXE boot with DHCP). It serves as the DHCP and DNS server for the cluster. The MAAS region controller and the rack controller roles run on the same machine in a VM on the ESXI.

Even though MAAS is slightly more complex to manage in some types of scenarios and has insufficient documentation in some areas, its benefits far outweigh its disadvantages.

Chef is a great configuration management tool that enables the deployment, management and control of customized configurations of multiple applications running on multiple targets with multiple operating systems.

We use Chef to create multiple configuration files (called recipes) and to deploy them on multiple targets. Chef provides central management of all targets, high-level customization options and configuration synchronization of all relevant target machines. The Chef configuration management tool is responsible for configuring all the applications on all the edge devices.

The Chef server runs as a VM on top of the ESXI. The Chef client is installed on each device in order to receive the latest recipes from the Chef server.

In general, Chef is more challenging to set up, because some knowledge of the Ruby language is required in order to achieve a high level of customization. In addition, it has the potential to fail, which renders a customization script useless, as well as limited agents and features in its free version. Albeit not perfect, Chef’s benefits far outweigh its disadvantages.

We use Docker Trusted Registry (DTR) to hold the Docker images that are deployed to multiple targets in a registry.

DTR eliminates network communication between Docker clients and the Image server and reduces network consumption when downloading new application versions as Docker images.

The Docker Registry is installed as a VM on the ESXI. It hosts the latest Docker images of the edge-device applications. The Docker Engine is installed on the edge OS in order to enable the Edgify Device Manager to run the Edgify Training Worker on the same device.

A slight disadvantage of DTR is that it requires some periodic cleanup of old images. Automation scripts are also required in order to automatically consume the latest images.

By default, the edge devices (Intel NUC in our case) boot from a local (hard disk) drive, which makes the entire process of installing an OS quite exhaustive when installing 100 edge devices, as they must be installed one by one.

The PXE Boot option (boot from LAN) enables the deployment of an OS from a PXE Boot server (MAAS in our case) via the network.

To use this option, the boot priority must be reordered in the BIOS of the device so that the network boot appears first. The change takes effect after a save and device reboot is completed. After this change, the next time that the device powers on, it gets an IP address from the DHCP server (MAAS in our case) and pulls and installs the required OS.

Deployment Procedure

Our deployment procedure consisted of the following steps –

  1. Preparing a VMWare ESXI server installed on the PowerEdge T430.
  2. Preparing the MAAS, Chef server and Docker Registry VMs on top of the ESXI.
  3. Enrolling each newly connected edge device through MAAS, which was then deployed with the Ubuntu OS created beforehand.
  4. Installing the Chef client on each running edge device so that it receives the recipes from the Chef server.
  5. Each edge device selected for training triggers a Training Worker container through the Docker engine.

After the infrastructure is finally ready, it can be used to deploy our core product — the Edgify Training and Prediction System.

Our system comprises three primary components that are installed on the edge devices –

  • Device Manager — The management container that is responsible for triggering operations on the edge device that originate from our dashboard, such as activating a training and deploying a prediction model.
  • Training Worker — The training container that is responsible for running an actual training process. The Training Worker is configured and triggered by the Device Manager (described above).
  • Predictor — The prediction container that is responsible for using the trained model in order to make realtime predictions. The Predictor is also triggered by the Device Manager (described above).

All these components have been implemented using Docker containers, which provide a variety of benefits, such as facilitating the building and deployment of containers, OS segmentation and so on.

Each time we release a new container version, we simply update the Chef environment with the latest version. Then, the next time the Chef client runs on the edge device, it pulls the latest docker image and runs it, instead of the previous one.

This approach enables the deployment and testing of multiple versions per development cycle without requiring the tedious task of updating each of the 100 devices individually.

The infrastructure is ready for the training stack

The following diagram illustrates our cluster deployment structure.

Deep Learning is one of the most significant aspects of Artificial Intelligence (AI). With the increased penetration and proliferation of connected devices, the amount of data collected from the world is increasing exponentially. Deep Learning continues to drive companies to create new solutions that can provide better hardware acceleration and optimization for edge devices.

After extensive research and as a result of the many decades of experience of our expert team, we have chosen the optimal and most effective components for building, managing, controlling and monitoring a rack of 100 devices. This document has described how we set up the infrastructure for leveraging a rack of 100 edge devices in order to collaboratively train, manage, deploy and run the training of deep-learning models. Obviously, the goal is to project these results to the real world, where 100 may be only the starting point!

In our next post in this three-part series, we will cover the results of our training model on various types of edge devices (and benchmark on an Intel stack).

Written by: Timor Kalerman and Nadav Tal-Israel

This series of articles regarding our SBC, real world edge device cluster, was written by the Edgify DevOps team. Feel free to follow their efforts on our Facebook, and LinkedIn pages.

who we are?

Edgify.ai has been researching distributed edge training for four years. We are building a platform (framework) that enables the training and deployment of machine-learning models directly on edge devices, such as smartphones, IoT devices, connected cars, healthcare equipment, smart dishwashers and more. We are committed to revolutionizing the privacy, information security, latency and costs associated with AI.

To report errors or issues with this post, contact us at nadav.israel@edgify.ai.

If you are interested in applying such approaches, contact us. We’d love to chat.

A foundational shift in the world of AI training. Deep Learning and Machine Learning training directly on edge devices.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store