Aether has been rapidly evolving since launch of the pilot network in late 2019. This initial network was a simple platform consisting of just two clusters, one in the cloud and the other at the ONF office in Menlo Park, and a single application, OMEC. Building the entire software platform was a manual effort but it was manageable given the network size.
In 2020 the scope of Aether expanded in size and scope with the addition of multiple Aether Connected Edges (ACEs), new components such as ROC, SD-Fabric and SD-RAN, and an expanded development team. With this growth, the process of applying new software updates to the Aether pilot network became more complex, error prone and hence much slower. Each ACE took anywhere from two days to a full week to build, test and deploy. For example, when deploying applications, errors such as passing incorrect cluster configuration files or executing commands on unintended clusters became common. In addition, the pilot network often had failures after software upgrades were installed due to software bugs, configuration errors or broken version dependencies between Aether software components. The results were that both operation and development teams had to spend a significant amount of time and effort to manually resolve software update issues, taking time away from other tasks. It quickly became clear that this manual approach to software testing and updating was not scalable or efficient.
To address these challenges and accelerate reliable distribution of software updates, ONF invested in development of an automated system to build, test and deliver most Aether components, from the infrastructure to applications. In addition, a standardized list of ACE equipment was created to simplify edge site equipment acquisition and installation plus reducing the test matrix. The combination of the software installation automation and standardization of ACE equipment resulted in immediate and significant time savings and reliability improvements. Demonstrating this point, once the automated system was in place, ONF smoothly and rapidly deployed 8 ACEs for the Pronto Project and an additional 6 ACEs at various enterprise sites. Each site took under a day (excluding hardware installation and bootstrap) to become fully functional. Additionally, software updates that included many new features were deployed in the pilot network after completion of all testing in a matter of minutes at the core and all ACE sites, significantly increasing reliability and stability.
In this blog, we provide insights into the vision and current status of the Aether Continuous Integration/Continuous Delivery (CI/CD) system and the resulting benefits experienced in initial ACE deployment and ongoing operations. The benefits include a significant reduction in the time required to bring on-line new ACE sites and the ability to distribute software updates to multiple ACE sites concurrently and accurately. In addition, with CI/CD in place, software updates can be verified and reliably distributed, reducing the cycle time from development to deployment. We will also share some of our lessons learned in our journey so far.
Aether CI/CD/CD Vision
The current CI/CD pipeline is a method to deliver software rapidly and reliably by introducing automation and improved workflows. Successful CI/CD minimizes manual errors, accelerates the feedback loops, and enables development teams to deliver software updates more frequently. To date, ONF has deployed CI/CD with longer term plans to add Continuous Deployment, creating a CI/CD/CD process wherein the software is automatically released from the repository to the production network, ultimately automating this last manual step.
Current Aether CI/CD Pipeline
The diagram below shows Aether’s CI/CD pipeline. Code changes are deployed to the pilot network only after successfully passing all stages shown in the diagram. Each stage is configured to run a set of tests, and the code needs to pass all tests to move forward from one stage to the next. Throughout the pipeline, should an error be identified in any step, an alert and feedback is sent to the relevant team so that the issue can be addressed. The code is then resubmitted and the CI cycle reinitiated from the beginning. The last step in the pipeline is deployment of validated changes to the pilot network which at this time is manually initiated by the operations team.
Figure 1. Aether CI/CD Pipeline Overview
The first step in Aether’s CI process is developers submit code changes into a shared repository. The changes are validated by an automated process that builds the code and automatically runs multiple tests. This enables bugs to be identified faster, improving software quality, and reducing the cycle time between code submission and release.
Aether is composed of multiple open source software components that are hosted on Github and Gerrit, including SD-Core, SD-Fabric and SD-RAN. Although each project has its own CI system and uses different tools such as Jenkins, Travis CI, and CircleCI, the pipeline configuration is similar and can be divided into two parts, Pre-Merge and Post-Merge.
Figure 2. CI Workflow
The Pre-Merge Jobs include a set of tests, typically unit and integration tests, as well as file format and license linters. They are triggered when a developer uploads changes to a repository for review, and leave a positive vote if all tests pass the review. Changes can merge only if both Pre-Merge Jobs and reviewers vote the change positively. Running more tests in this stage helps identify bugs faster, but also provides rapid feedback to a developer for each commit which helps to reduce cycle time. In most Aether projects, Pre-Merge Jobs run in a containerized or virtualized environment, allowing multiple commits to be tested simultaneously and helping to reduce test times for each commit.
Once a new change is merged, a Post-Merge Job is triggered to make new artifacts available for the next stages in the CI/CD pipeline. As Aether is a Kubernetes-based platform, all Aether applications are expected to run as a container, so the Post-Merge Job includes building and publishing the application container image to the image repository. For Helm chart repositories, it publishes a new helm chart to the helm registry.
The CD process automates pushing of code changes verified by the CI process to one or more environments where the changes are staged for final testing. At the successful conclusion of testing, the operations team manually deploys the software into the Aether pilot network by clicking a single button. The vision is to eventually automate this final step and add Continuous Deployment to the current CI/CD process wherein the software is released into the production network without human intervention.
Every stage in the Aether CD pipeline involves delivery automation and test automation. In order to achieve delivery automation, the CD process leverages the GitOps pattern of using Git repository as a source of truth for declarative infrastructures and applications, and an automation process that ensures the actual state of infrastructure and applications converge towards the desired state declared in the repository. This is considered best practice for modern application development for many reasons including increased productivity, security, and reliability.
There are two parties that can create pull requests for this repository, commonly referred to as the “environment repository”. One is the CI pipeline and the other is the operations team. The CI pipeline creates a pull request at the end of a post-merge job to update the next environment in the pipeline for testing a recently published application image or helm chart. The operations team can also make changes to the repository when adding new ACE sites or updating infrastructure or application configurations. The changes made to this repository trigger a Jenkins pipeline that rolls out the changes to target environments. The last job in the pipeline reports to the operations team whether the deployment was successful so that the team can immediately rollback to the previous software version if the upgrade was not successful.
Figure 3. CD Workflow
Infrastructure as Code (IaC) tools such as Terraform help make the delivery pipeline simple as it provides a consistent way to express desired states of Aether infras such as GCP, AWS and Rancher as code and apply state changes to the infrastructures to reach the desired states. In addition to Terraform, Ansible is also used in the delivery pipeline to install software packages such as a VPN server and router daemon to the ACE management server.
Tests in the CD stage include stress, regression, soak and integration tests. The goal is to make sure no serious bugs remain and code is stable for deployment to the pilot network. Some tests are triggered by Jenkins jobs scheduled at specific times of a day, whereas some other tests such as soak tests are performed continuously for at least 3 days.
As the number of ACEs increases, we realized that scalability, controllability and visibility are also very important. Therefore, we are in the process of replacing the application delivery portion of the current implementation with Rancher’s built-in CD system called Fleet. Fleet defines itself as GitOps at scale, which is designed to manage a million clusters, with great visibility and control. We expect Fleet to help resolve some of the challenges we currently face.
Monitoring is also essential for successful delivery so that the operations team can immediately be alerted to any failures during the deployment or errors being produced in the software. In Aether, we use Prometheus extensively to monitor every part of the pilot network from node level to service level.
As Aether continues to evolve and expand, ONF is continuing to refine the CI/CD processes, tests and tools to enhance the reliability of the software and efficient and reliable distribution of the deployment. CI/CD has provided significant benefits by accelerating code development, testing, delivery and reliability. As we integrate Continuous Deployment into the pipeline we expect to further increase operational efficiencies and strengthen the resiliency of the Aether pilot network.