Scaling Our Private Portals with Open edX and Docker
Ever since we launched, Cognitive Class has hit many milestones. From name changes (raise your hand if you remember DB2 University) to our 1,000,000th learner, we’ve been through a lot.
But in this post, I will focus on the milestones and evolution of the technical side of things, specifically how we went from a static infrastructure to a dynamic and scalable deployment of dozens of Open edX instances using Docker.
Open edX 101
Open edX is the open source code behind edx.org. It is composed of several repositories, edx-platform being the main one. The official method of deploying an Open edX instance is by using the configuration repo which uses Ansible playbooks to automate the installation. This method requires access to a server where you run the Ansible playbook. Once everything is done you will have a brand new Open edX deployment at your disposal.
This is how we run cognitiveclass.ai, our public website, since we migrated from a Moodle deployment to Open edX in 2015. It has served us well, as we are able to serve hundreds of concurrent learners over 70 courses every day.
But this strategy didn’t come without its challenges:
- Open edX mainly targets Amazon’s AWS services and we run our infrastructure on IBM Cloud.
- Deploying a new instance requires creating a new virtual machine.
- Open edX reads configurations from JSON files stored in the server, and each instance must keep these files synchronized.
While we were able to overcome these in a large single deployment, they would be much harder to manage for our new offering, the Cognitive Class Private Portals.
Cognitive Class for business
When presenting to other companies, we often hear the same question: “how can I make this content available to my employees?“. That was the main motivation behind our Private Portals offer.
A Private Portal represents a dedicated deployment created specifically for a client. From a technical perspective, this new offering would require us to spin up new deployments quickly and on-demand. Going back to the points highlighted earlier, numbers two and three are especially challenging as the number of deployments grows.
Creating and configuring a new VM for each deployment is a slow and costly process. And if a particular Portal outgrows its resources, we would have to find a way to scale it and manage its configuration across multiple VMs.
At the same time, we were experiencing a similar demand in our Virtual Labs infrastructure, where the use of hundreds of VMs was becoming unbearable. The team started to investigate and implement a solution based on Docker.
The main benefits of Docker for us were twofold:
- Increase server usage density;
- Isolate services processes and files from each other.
These benefits are deeply related: since each container manages its own runtime and files we are able to easily run different pieces of software on the same server without them interfering with each other. We do so with a much lower overhead compared to VMs since Docker provides a lightweight isolation between them.
By increasing usage density, we are able to run thousands of containers in a handful of larger servers that could pre-provisioned ahead of time instead of having to manage thousands of smaller instances.
For our Private Portals offering this means that a new deployment can be ready to be used in minutes. The underlying infrastructure is already in place so we just need to start some containers, which is a much faster process.
Herding containers with Rancher
Docker in and of itself is a fantastic technology but for a highly scalable distributed production environment, you need something on top of it to manage your containers’ lifecycle. Here at Cognitive Class, we decided to use Rancher for this, since it allows us to abstract our infrastructure and focus on the application itself.
In a nutshell, Rancher organizes containers into services and services are grouped into stacks. Stacks are deployed to environments, and environments have hosts, which are the underlying servers where containers are eventually started. Rancher takes care of creating a private network across all the hosts so they can communicate securely with each other.
Getting everything together
Our Portals are organized in a micro-services architecture and grouped together in Rancher as a stack. Open edX is the main component and itself broken into smaller services. On top of Open edX we have several other components that provide additional functionalities to our offering. Overall this is how things look like in Rancher:
There is a lot going on here, so let’s break it down and quickly explain each piece:
- Open edX
lms: this is where students access courses content
cms: used for authoring courses
forum: handles course discussions
nginx: serves static assets
rabbitmq: message queue system
glados: admin users interface to control and customize the Portal
companion-cube: API to expose extra functionalities of Open edX
compete: service to run data hackathons
learner-support: built-in learner ticket support system
lp-certs: issue certificates for students that complete multiple courses
- Support services
lms-workers: execute background tasks for `lms` and `cms`
glados-worker: execute background tasks for `glados`
letsencrypt: automatically manages SSL certificates using Let’s Encrypt
load-balancer: routes traffic to services based on request hostname
mailer: proxy SMTP requests to an external server or sends emails itself otherwise
ops: group of containers used to run specific tasks
rancher-cron: starts containers following a cron-like schedule
- Data storage
ops service behaves differently from the other ones, so let’s dig a bit deeper into it:
Here we can see that there are several containers inside
ops and that they are usually not running. Some containers, like
edxapp-migrations, run when the Portal is deployed but are not expected to be started again unless in special circumstances (such as if the database schema changes). Other containers, like
backup, are started by
rancher-cron periodically and stop once they are done.
In both cases, we can trigger a manual start by clicking the play button. This provides us the ability to easily run important operational tasks on-demand without having to worry about SSH into specific servers and figuring out which script to run.
One key aspect of Docker is that the file system is isolated per container. This means that, without proper care, you might lose important files if a container dies. The way to handle this situation is to use Docker volumes to mount local file system paths into the containers.
Moreover, when you have multiple hosts, it is best to have a shared data layer to avoid creating implicit scheduling dependencies between containers and servers. In other words, you want your containers to have access to the same files no matter which host they are running on.
Each Portal has its own directory in the NFS drive and the containers mount the directory of that specific Portal. So it’s impossible for one Portal to access the files of another one.
One of the most important file is the
ansible_overrides.yml. As we mentioned at the beginning of this post, Open edX is configured using JSON files that are read when the process starts. The Ansible playbook generates these JSON files when executed.
To propagate changes made by Portal admins on
glados to the
cms of Open edX we mount
ansible_overrides.yml into the containers. When something changes,
glados can write the new values into this file and
cms can read them.
We then restart the
cms containers which are set to run the Ansible playbook and re-generate the JSON files on start up.
ansible_overrides.yml is passed as a variables file to Ansible so that any values declared in there will override the Open edX defaults.
By having this shared data layer, we don’t have to worry about containers being rescheduled to another host since we are sure Docker will be able to find the proper path and mount the required volumes into the containers.
By building on top of the lessons we learned as our platform evolved and by using the latest technologies available, we were able to build a fast, reliable and scalable solution to provide our students and clients a great learning experience.
We covered a lot on this post and I hope you were able to learn something new today. If you are interested in learning more about our Private Portals offering fill out our application form and we will contact you.