Scaling Our Private Portals with Open edX and Docker
Posted on July 18, 2018 by Luiz Aoqui
Ever since we launched, Cognitive Class has hit many milestones. From name changes (raise your hand if you remember DB2 University) to our 1,000,000th learner, we’ve been through a lot.
But in this post, I will focus on the milestones and evolution of the technical side of things, specifically how we went from a static infrastructure to a dynamic and scalable deployment of dozens of Open edX instances using Docker.
OPEN EDX 101
Open edX is the open source code behind edx.org. It is composed of several repositories, edx-platform being the main one. The official method of deploying an Open edX instance is by using the configuration repo which uses Ansible playbooks to automate the installation. This method requires access to a server where you run the Ansible playbook. Once everything is done you will have a brand new Open edX deployment at your disposal.
This is how we run cognitiveclass.ai, our public website, since we migrated from a Moodle deployment to Open edX in 2015. It has served us well, as we are able to serve hundreds of concurrent learners over 70 courses every day.
But this strategy didn’t come without its challenges:
- Open edX mainly targets Amazon’s AWS services and we run our infrastructure on IBM Cloud.
- Deploying a new instance requires creating a new virtual machine.
- Open edX reads configurations from JSON files stored in the server, and each instance must keep these files synchronized.
While we were able to overcome these in a large single deployment, they would be much harder to manage for our new offering, the Cognitive Class Private Portals.
COGNITIVE CLASS FOR BUSINESS
When presenting to other companies, we often hear the same question: “how can I make this content available to my employees?“. That was the main motivation behind our Private Portals offer.
A Private Portal represents a dedicated deployment created specifically for a client. From a technical perspective, this new offering would require us to spin up new deployments quickly and on-demand. Going back to the points highlighted earlier, numbers two and three are especially challenging as the number of deployments grows.
Creating and configuring a new VM for each deployment is a slow and costly process. And if a particular Portal outgrows its resources, we would have to find a way to scale it and manage its configuration across multiple VMs.
ENTER DOCKER
At the same time, we were experiencing a similar demand in our Virtual Labs infrastructure, where the use of hundreds of VMs was becoming unbearable. The team started to investigate and implement a solution based on Docker.
The main benefits of Docker for us were twofold:
- Increase server usage density;
- Isolate services processes and files from each other.
These benefits are deeply related: since each container manages its own runtime and files we are able to easily run different pieces of software on the same server without them interfering with each other. We do so with a much lower overhead compared to VMs since Docker provides a lightweight isolation between them.
By increasing usage density, we are able to run thousands of containers in a handful of larger servers that could pre-provisioned ahead of time instead of having to manage thousands of smaller instances.
For our Private Portals offering this means that a new deployment can be ready to be used in minutes. The underlying infrastructure is already in place so we just need to start some containers, which is a much faster process.
HERDING CONTAINERS WITH RANCHER
Docker in and of itself is a fantastic technology but for a highly scalable distributed production environment, you need something on top of it to manage your containers’ lifecycle. Here at Cognitive Class, we decided to use Rancher for this, since it allows us to abstract our infrastructure and focus on the application itself.
In a nutshell, Rancher organizes containers into services and services are grouped into stacks. Stacks are deployed to environments, and environments have hosts, which are the underlying servers where containers are eventually started. Rancher takes care of creating a private network across all the hosts so they can communicate securely with each other.
GETTING EVERYTHING TOGETHER
Our Portals are organized in a micro-services architecture and grouped together in Rancher as a stack. Open edX is the main component and itself broken into smaller services. On top of Open edX we have several other components that provide additional functionalities to our offering. Overall this is how things look like in Rancher:
There is a lot going on here, so let’s break it down and quickly explain each piece:
- Open edX
lms
: this is where students access courses contentcms
: used for authoring coursesforum
: handles course discussionsnginx
: serves static assetsrabbitmq
: message queue system
- Add-ons
glados
: admin users interface to control and customize the Portalcompanion-cube
: API to expose extra functionalities of Open edXcompete
: service to run data hackathonslearner-support
: built-in learner ticket support systemlp-certs
: issue certificates for students that complete multiple courses
- Support services
cms-workers
andlms-workers
: execute background tasks for `lms` and `cms`glados-worker
: execute background tasks for `glados`letsencrypt
: automatically manages SSL certificates using Let’s Encryptload-balancer
: routes traffic to services based on request hostnamemailer
: proxy SMTP requests to an external server or sends emails itself otherwiseops
: group of containers used to run specific tasksrancher-cron
: starts containers following a cron-like schedule
- Data storage
elasticsearch
memcached
mongo
mysql
redis
The ops
service behaves differently from the other ones, so let’s dig a bit deeper into it:
Here we can see that there are several containers inside ops
and that they are usually not running. Some containers, like edxapp-migrations
, run when the Portal is deployed but are not expected to be started again unless in special circumstances (such as if the database schema changes). Other containers, like backup
, are started by rancher-cron
periodically and stop once they are done.
In both cases, we can trigger a manual start by clicking the play button. This provides us the ability to easily run important operational tasks on-demand without having to worry about SSH into specific servers and figuring out which script to run.
HANDLING FILES
One key aspect of Docker is that the file system is isolated per container. This means that, without proper care, you might lose important files if a container dies. The way to handle this situation is to use Docker volumes to mount local file system paths into the containers.
Moreover, when you have multiple hosts, it is best to have a shared data layer to avoid creating implicit scheduling dependencies between containers and servers. In other words, you want your containers to have access to the same files no matter which host they are running on.
In our infrastructure we use an IBM Cloud NFS drive that is mounted in the same path in all hosts. The NFS is responsible for storing any persistent data generated by the Portal, from database files to compiled static assets, such as images, CSS and JavaScript files.
Each Portal has its own directory in the NFS drive and the containers mount the directory of that specific Portal. So it’s impossible for one Portal to access the files of another one.
One of the most important file is the ansible_overrides.yml
. As we mentioned at the beginning of this post, Open edX is configured using JSON files that are read when the process starts. The Ansible playbook generates these JSON files when executed.
To propagate changes made by Portal admins on glados
to the lms
and cms
of Open edX we mount ansible_overrides.yml
into the containers. When something changes, glados
can write the new values into this file and lms
and cms
can read them.
We then restart the lms
and cms
containers which are set to run the Ansible playbook and re-generate the JSON files on start up. ansible_overrides.yml
is passed as a variables file to Ansible so that any values declared in there will override the Open edX defaults.
By having this shared data layer, we don’t have to worry about containers being rescheduled to another host since we are sure Docker will be able to find the proper path and mount the required volumes into the containers.
CONCLUSION
By building on top of the lessons we learned as our platform evolved and by using the latest technologies available, we were able to build a fast, reliable and scalable solution to provide our students and clients a great learning experience.
We covered a lot on this post and I hope you were able to learn something new today. If you are interested in learning more about our Private Portals offering fill out our application form and we will contact you.
Happy learning.
Tags: architecture, docker, IBM Cloud, private portals, rancher, scaling