Graham Ballantyne – You’ve Installed Open Source Canvas, Now What?

Graham Ballantyne – You’ve Installed Open
Source Canvas Now What Graham Ballantyne: All right. Good morning.
Welcome to InstructureCon. It’s awesome to be the first guy up, it means I get to
enjoy the rest of the conference. So I’m going to be telling you a story today, in
three acts, about how Simon Fraser University implemented and scaled up Open Source Canvas.
We’re going to talk few things. We’re going to look at the production start guide
and how you can go beyond it, how you can keep up with Instructure’s release cycle
and some demo, care and feeding tips to make sure your Canvas is nice and happy. Before we get started, this is going to be
a pretty high level talk. We only got 30 minutes so we’re not going to go too deep into any
one topic. I’m going to be around at the Hack Night tonight. I’ve got two of my colleagues
here as well. So, if I can’t answer your question afterwards if there’s something
I don’t cover, please come and find us and we’re more than happy to share what we’ve
learned. This is also what worked for us at SFU. There’s
tons of different ways you can implement Canvas. This is what’s been, in our experience,
we found worked and hopefully you can take some of these back to your institution and
make it work for you and its still evolving. We’re always finding new ways to tweak this
and make it even more efficient. So Simon Fraser, we’re in Vancouver, BC
and we’re a midsize comprehensive university. We’ve got about 30,000 students, 6500 faculty
and staff and we love Canvas, we’re big fans. We previously were WebCT shop. We’re
WebCT shop for years and years and years right back to when it started just down the road
at UBC, it was a PERL application and we’ve been using it for a really long time. And
we got notice from Blackboard that said we’re going to be end of life thing this thing that
you’ve been using for way too long, you need to get off of it. We’re going to turn
it off. So we started looking around and we did a whole RFB process, looked at all the
usuals. We looked at, you know, the ones you’d expect the corporate ones like Blackboard
and D2L, open source ones like Moodle. And partway through that process, we heard about
this thing called Canvas. We hadn’t heard of it before so we spun it up and played with
it, we went oh this is pretty cool. And the rest of the selection group looked at it and
went this is pretty cool and a very rare thing happened in academia, we had unanimous support
for it from all, across all campus groups. Everyone wanted us to go with Canvas. So we
did. This was our implementation timeline. Pretty rapid. We went from 0 to 25,000 students
in 5 terms. So, right from selecting Canvas in the Fall of 2012, we jumped right into
a pilot phase. The next term with a few credit courses, it’s actual live courses into a
coproduction phase with WebCT, so we were running two systems in parallel at that time.
And during that time we exceeded our peak WebCT usage. We had more courses in Canvas
than we’d ever had in WebCT before. So yeah, we blew away our WebCTs just in the first
couple of terms. We weren’t even in full production yet. The last term, spring 2014 just ended in April,
that was our full production term WebCT was retired, shut off. We had just under or just
over 24,000 students, just under a thousand courses and just under a thousand instructors
on it. And we did it all with a really small technical team. This is us. There’s three
of us here a few more back home keeping things running. Seven of us including our team lead
business analyst, developer, operations guides, we had some support from our shared infrastructure
groups for storage and database admin and none of us work solely on Canvas. We’re
all working on at least one other major system whether it’s our e-mail system which is
also an open source product or our content management system for our web stuff. So we’re
all splitting our time between all of these different tasks. And so you might be asking, why did you go
Open Source? Why didn’t you just use the Cloud version? And for us it wasn’t cost.
We’re paying 7 full time employees to keep this thing running. The main reason for us
was a legal one. British Columbia has a very strict privacy laws. It says, we cannot compel
students to use non-Canadian hosted services. So that means anything on Amazon is out. Anything
none Google is out. We can’t compel them as part of their courseware to do it. We can
opt in if they have informed consent but taking a credit course was the only thing, the online
option’s the only way to do it, that’s compelling them to do it. So, we had to go
Open Source and run it ourselves. We’ve got a long history of doing that at SFU. We,
brought in lots of Open Source products. We do in house development as well. So it was
kind of a natural fit for us. So, show of hands. Who here has ever installed
Open Source Canvas? Okay. Keep them up. Put it, okay. Who is running it in a production
environment? Yeah. Non-Instructure employees, non-SFU employees. Who is looking to do it?
Okay, good. So you’ve probably seen this production
start guide that Instructure has on the Github Wiki. It’s a very thorough document. It
will walk you through installing the Open Source version of Canvas in a production environment
and it’s great. We used it when we’re going through the whole setup process. But
it makes some assumptions, assumptions that may not be accurate for your environment.
For example, it assumes you’re going to be using Ubuntu, specific versions of Ubuntu.
We’re not, we’re Red Hat shop. All of our core Linux stuff is on Red Hat Enterprise
Linux. So we had to figure that part out. We kept detailed notes. We tried doing Ubuntu.
We found it was easier to get it running on Red Hat probably because we just had the experience
with it. We had deep technical knowledge of Red Hat from our infrastructure guides. So
you got to keep that in mind. Production start guide is not the end all be all, you can go
beyond it and kind of tweak it to be what you want. The guide makes mention of having
multiple servers but that implementation is left up to you to figure out how you’re
going to do it. And you could install Canvas on one machine serving rails, web stuff, database,
delayed jobs, [indiscernible] [00:06:12] you don’t want to do that. You have no redundancy.
If something goes wrong, that’s it. Your one machine is gone. What are you going to
do? How do you take that guide and scale it up to what we call real production? First step is to own the code. The production
start guide gives you instructions for deploying from the Instructure GitHub repo. I’m going
to tell you not to do that. What we did is we created our own GitHub organization. So
we’re We forked that repository so now we have a complete copy of that repo.
There is no connection between the Instructure one. It’s our code. We can do whatever we
want with it. Step three. Step four, except its Open Source so there’s no really no
profit in it. Why would you want to do this? Well like I
said, you’ve got a copy of that code now so if Instructure becomes evil, they get bopped
by Blackboard or otherwise goes rouge, you’ve got that code. You can do what you want with
it. You could continue to run it. You’re not beholden to them with that repository
staying alive. And you could make modifications to the code if you want and not have to deal
with putting them back when Instructure updates their repository. And you can contribute code
back. One of the great things about Open Source is if you see something you want to change,
you got a feature that’s not there, you can easily build it in and try and submit
it back. Now if that interests you, I would encourage you to go check my colleague, Andrew
Leung session tomorrow. He’s going to be talking about how you can do that. We’ve
put a lot of pool requests back in to Instructure. Andrew’s probably had the most of them and
most of them been accepted. So, check that one out tomorrow. So, back to separation of concerns in having
multiple servers. Like I said, you could do it all in one box. Probably don’t want to
do that. This is us. This is SFU’s Canvas infrastructure. So, we’ve got a load balancer
at the top. We’ve got what we call the App Nodes and we’ve got 8 of them. They’re
in a pool on our load balancer and those are what serve user phasing request. So when you
go to you’re hitting one of those 8 machines. The 7 next to it are what
we call the file nodes and those handle are file domains so and so
anytime you’re doing a file upload or download it’s hitting those boxes. We had them combined
at the start. We have some pretty file intensive courses so we decided to split them. We saw
a better distributional load between them. So the management nodes off to the side, there’s
2 of them, those are where we run our delayed jobs. They’re not in any pool on a load
balancer, not serving any traffic so we’re doing delayed jobs to view our overnight enrollments,
CSV enrollments and things like that on those machines. This keeps the load off of the user
phasing machines. And we’ve got 5 hot spares just sitting there. They’re VMs. All of
these are VMs. They’re all configured identical. They have the same code. We’ve got 5 of
them just sitting there in a ready state and if we get a sudden spike in traffic we can
just dump them into any of the pools and then helps scale it out horizontally. We got a
bunch of shared infrastructure underneath. So we’ve got shared storage. Canvas out
of the box. Uses S3 to host the assets. Again we can’t use Amazon stuff so we’ve got
mount points from our net apps on all the machines. We’re using PostgreSQL. If you’re
using MySQL and you haven’t heard, you should probably stop using MySQL for Canvas, they
are going to be taking that out real soon. We’re pretty happy with PostgreSQL from
the start. And then, Redis Cache. Canvas makes extensive use of caching. Says it’s optional.
It really isn’t. If you want to find out if that is not optional, turn it off in production
and see what happens. You’ll see your CPU start to go way up. That’s us. That’s
our infrastructure. So, we’ve built it up. It’s running and
like any software you probably want to keep it up to date. Instructure has a release cycle
and they are always working on the code. They’re always improving it and it’s a version-less
system. They’re not putting out this is the 1.0, this is the 2.0. They’re just pushing
code like a fire hose of code and you want to keep up to date with it. And to do that
you got to understand the Instructure release cycle. So Instructure pushes a new release
every 3 weeks on Saturday. And then they do incremental hot fix releases in between. They
put all the releases on GitHub. They tag them with a release tag. So this is the May 3rd
release and you’ll see that .12 at the end. Before it, there’s a whole bunch. After
it there’s a whole bunch more. So as they’re working on it, as they’re working on whatever
code is going into that release, they’re releasing it through their system and then
ultimately up to GitHub and it gets tagged like this. I even found a way to actually figure out
which of these tags is the one their pushing two production on any given Saturday. I know
there’s some Instructure folks at the back. If you can help me with that, talk to me after
because right now we’re just kind of picking one when we do our merge. We just say, this
one looks good, let’s try that and see what happens. So, if you can help us out there
that would be awesome. So that’s Instructure. At SFU we wanted to keep up with it as well.
We wanted to be in sync as much as we could with Instructure. Just so we’re not, when
we do upgrades, we’re not doing massive upgrades. We’re just doing as little incremental
ones. We’re not making huge changes. So similar 3-week cycle. I couldn’t find an
animated gift of our mascot. He’s a Scottie dog. But, so we have a same 3 week cycle.
We do ours on Friday and we stay one release behind the Cloud. We let the paying customers
be the guinea pigs. So if on May 3rd Instructure did a release of their May 3rd release, we
did, the previous Friday we did the April 12th. And then 3 weeks later, we’re doing
the May 3rd release. So we’re just one release behind. And so it’s a pretty crazy cycle.
I mean, we’ve, some of other systems we might touch once every year or two and they’re,
you know, major point release kind of things so this is a new experience for us. So we
had to develop some work flows for doing it. The first thing we did as we decided to do,
go 3 branches in our GitHub repo. We have, the first one we call Edge, that’s our bleeding
Edge branch. Its where SFU code and Canvas code meet for the first time and starts to
get to know each other and that’s where we pull in. When we pick that, we just reach
in the bag of release tags and pull one out and say this is the one we’re going with.
That’s where we put it in first. We do some technical testing there. And then we push
it into our developed branch and that’s where our developers do our development. We’re
doing stuff on Canvas. We’ve got our own plug-ins. We’re sometimes modifying core
code. So we do it on that develop branch. And then finally we’ve got the deploy branch
and that’s a clean branch. That has to be clean. Must be deployable to production at
any time. So if we have to spin up a new VM, push code to it, that branch is sort of our
term master, it’s clean. And so, a process kind of goes a little bit like this for getting
stuff in. I guess at first we got to figure out which release tag we’re going with.
So usually on the Monday after we do our release, we do our release on Friday, Monday I come
in and I just see which is the most recent one from the release we’re going with, I
grab that one. I pull it in and I start a merge and it goes kind of like this and it’s
really hard for you to see on this screen. There’s a huge block of red text at the
bottom of that screen. Those are merge conflicts. Those are merge conflicts in files we have
never touched. I can guarantee you I’ve never touched the Turkish language localization
file and this is something we’ve been trying to battle with for a while. So again, if anyone
from Instructure who can give us a hand. We’re probably doing get wrong because the default
state for using Get is to do it wrong. So, every Monday I’m always just going, yeah
this one’s theirs, this one theirs, no that one’s ours, this one’s theirs, this one’s
theirs. So, again, find me the Hack Night if you can help me figure this one out. So once we’ve got that merge resolved and
it’s up on our Edge cluster, we start doing our technical test and it’s pretty quick.
It’s really just does all our stuff still work. Do our mods work? Do our plug-ins work?
Did all that stuff come through cleanly? Did we have a bad merge? Did I accidentally clobber
one of our mods when I brought in an Instructure command? After that we release it for functional
test. Those are our partners in the teaching and learning center, Center for Online Distance
Education and they’re looking to see is this all of their stuff. Does all of the Instructure
stuff still works? Is everything working as documented? Do we need to change documentation?
Do we need to generate new documentations because of new features? How do these new
features work? And once we are all set aside, we got to go, we push the production and then
we start the whole thing again the next week. So it’s a pretty rapid cycle and we’ve
tried to automate it as much as we possibly can just to keep some semblance of sanity.
We used a bunch of different tools and techniques for that. And I talked about GitHub earlier.
The context of forks. So we have our fork and our developers have their own forks as
well. So, we’ve got Instructure repo forked off to SFU and then I have
That’s a fork of the SFU repo and that’s where we do all our development. We do it
in those forks and then when we’re, say I’m working on a new feature or a bug fix,
once I’ve satisfied it’s ready to go, I will do a full request into the SFU repository
and then it goes into our admittedly fairly lightweight code review process. Our policy
is that any code we’re putting has to get looked at by two other sets of eyes, so not
the person who wrote it, and they’ve got to give a thumbs-up in the GitHub product.
So two thumbs goes to production. And that’s a horribly blurry photo of what
it kind of looks like. This was a pool request and it’s going to the code review process
and, you know, we’ve got comments happening on individual lines of codes and this is the
right way to do it. Can we do the different way? I think this is a bug, whatever. And
it’s been working fairly well for us. Bamboo is our continuous delivery and deployment
system. It’s what we use to do our deploys. It handles a bunch of our different production
systems. Kind of looks like that. It just gives us one click access to doing a deploy.
So I can, I could right now if I was insane go to Bamboo and hit deploy to production
and it would just start spitting code out to all our production servers and it gives
us one place to look for the logs for all those deploys. If there’s any failures it
will show up there and start loading instead of failure. Capistrano is a remote server automation and
deployment tool. It’s written in Ruby. It gets used a lot for Ruby in Rails application.
You can use it for anything though. Use it to deploy node apps as well. And it just basically
runs code in parallel on many servers over SSH and it uses recipes to do it. So you define
a recipe or a task and it say, so this task is the load notifications task, it run this
command on every server to do this thing. And Capistrano just runs these in order in
parallel. So, Bamboo will, Bamboo checks out the code and just says cap deploy and that
just starts the process of spitting out code to all the servers. And it gives us versioned
releases. So it deploys into a date stamp directory and it just moves the SIM link around.
So it gives us the ability to roll back to a previous release. So if something goes horribly
wrong, deploy fails, we can just roll it back right back where we started. Phusion Passenger. If you’ve installed Canvas
you’re probably familiar with it. It’s the app server recommended in the production
guide. It has an Open Source version and we were using that for quite some time. We decided
to end up going with the enterprise version for a few reasons. One of the big features
in the enterprise version is rolling restarts. So you probably restarted Canvas before and
it takes a little while for it to start up. And when we’re doing updates our previous
method of doing updates before we went to this was we were doing them at 10 o’clock
at night. We would put up, we’re close to go away page, do the deploy because it would
start up and would take a couple of minutes for it to come out and we didn’t want students
to be like hanging while that was happening. The enterprise version does a rolling restart.
So as is shuts down workers it starts up new ones and there’s really no downtime visible
to the student. This has enabled us to do deploys at noon on a Friday which is awesome
because all of the staff are in the office. Finally, this is our junk drawer. /usr/local/canvas
is a mount point we have on all our machines, it’s the same on every machine in any cluster
whether it’s a test cluster, or production, or staging cluster and it’s just really,
it’s a place for us to stick stuff we need on all those machines including our configuration
files. Config files are one thing you don’t want to have in your GitHub repo because they
probably have credentials in them. You don’t want your database credentials sitting on
a public repository somewhere where someone could get a hold of them. So, we put them
all in here. They all have a prefix on the founding that says it’s the production one
or the test one or staging one and when the machine boots up it runs a script. Looks at
its host name, determines find my production machine, I need to go grab all the production
config files and put them in the right place. That script runs during a deploy as well.
There’s other utilities on there that give us the ability to run code on every node or
gets stats on Passenger and things like that. Okay, so, the last part, you’ve got it installed.
You’re updating it on a fairly rapid cycle. You need to kind of keep this, you need to
keep this thing running. It’s a production system. If your students are using, your faculty
are using them. So how do you keep your Canvas happy? This was the huge win for us when we
did Canvas. We have a lot of experience at SFU keeping large, complex apps running for
tens of thousands of people. We had zero Ruby on Rails experience. There’s one guy in
the office who wasn’t on the Canvas team who’d used Rails before, that was it. The
Canvas even had nothing. So, sticking what we knew let us concentrate on what we didn’t
know so that’s why we decided to stick with Red Hat instead of having to go to Ubuntu.
Sticking with Apache instead of using EngineX. We have experience on all those things so
we could spend our time building up Rails experience, figuring how Canvas work, figuring
out the Ruby ecosystem. Beware core mods. I mentioned Andrew’s session
earlier about being able to contribute code back to Canvas. One of the awesome things
about Open Source is you’ve got the code and you can modify it and you can make it
do whatever you want. One of the terrifying things about Open Source Software is you’ve
got the code and you can modify it and you can make it do whatever you want. Once you
do that, you own that Mod, it’s yours and if Instructure comes along and stomps all
over it, you’ve got to figure out what to do about it. So, before you go down the road
of touching core code, you want to ask yourself, can I do this with the API, can I write an
LTI to do it. Could I even use custom CSS or Java Script or a vendor plug-in or whatever’s
replacing those in Rails 3 and 4. Sometimes you might be able to do one of those tools.
Sometimes you might actually just have to get in there and get your hands dirty in the
core code and if you’re going to do that, it’s a really good idea to reach out to
Instructure to the community and say this is what I want to do, are you guys considering
this already or would this be something you consider as a pool request? Because a lot
of times, they either are considering it or its something they would consider bringing
in because it may have a benefit to the greater community. And if it do that, it’s awesome
because then it’s their problem, it’s not your problem anymore. All right. Finally, keeping an eye on things.
So production system, you’ve got to monitor it. We use a bunch of tools for this. One
of our main ones is Xymon. It’s your, you know, log standard systems monitoring tool
looking at CPU and then reading discs and things like that. And as custom probes, we
have one that looks at the number of failed jobs and the delay jobs queue and will alert
us if that gets above a certain level. It’s ugly. Fortunately we never really have to
look at the UI for it. We use it to get e-mail alerts and stuff so if a system goes down
we get paged. LogStash is a log indexing and searching tool.
Canvas puts out a lot of log output in a couple different places and there’s a ton of information
in there that can help you. It’s really hard to dig through that information. So LogStash
takes that. It will parse it. I can sort out. You can give it custom queries and it will
sort out interesting metrics and that’s all we’re really using it for right now.
Doing pretty light use of it and mostly we’re taking the metrics. We care about all those
logs and we’re feeding them into our graphite and StatsD system. Graphite’s a system for
storing and rendering time series graphs and StatsD is just a really thin daemon that sits
on top of that but aggregates stuff and give you some extra types of stats that Graphite
doesn’t nearly support and Canvas has built in support for StatsD. There’s just a comp
file you modify. You put in your StatsD server and you’re off to release. All of these feed into dashboards. We are
addicted to dashboards at SFU. We love them. We got them everywhere. We use a framework
called Dashing which is from the folks at Shopify and it makes it fairly easy, it’s
free to create custom dashboards. This is our Canvas dashboard. It’s sped up a little
bit. If blinking lights bother you might want to turn away. So this is our main Canvas dashboard.
This is actually running during one of our deploys on May 2nd when I was deploying the
April 12th release. And so, and this is why it’s starting to go red. So, we got CPU
monitors and Passenger key monitors for all production app nodes over on this side here.
We’ve got graphs and we had a graph for things like hits per hour for Rails and Apache.
We’ve got some stuff coming out of Google analytics at the top, the active web users.
This was a really slow day. That’s coming out of Google’s real time API. And then
other things like just GitHub stats and our Bamboo stats. And these have been really useful for us.
We’ve got them everywhere. They’re on, this one’s on a giant TV in our common area.
Its right by where we all eat lunch, we all walk by it on our way to our offices, its
right outside the boss’s office which is not a great thing sometimes but its nice having
that visibility. So, we’re just walking by we can say oh there’s four nodes and
they’re all yellow, something’s probably going wrong, we should take a look at it.
And most of our developers have them too. This is my desk so I’ve got it up on a spare
monitor to the side. So it’s just in the corner of my eyes so while I’m standing
there typing, I can see if something’s starting to go wrong. All right, last thing. Getting help. Canvas
is pretty complex. Rails is pretty complex and you’ll probably going to run into some
problems and it’s good to know where to get help. Use the two main resources that
you want to be paying attention to. If you’re not on these already, get there. The Canvas
LMS channel on free note on IRC is awesome. There’s a wealth of experience in there.
There’s open source users, commercial users, technical people, non-technical people, it’s
great. And there’s Instructure engineers in there too and it’s a huge thank you to
all those Instructure folks who hang out there. It’s really nice to talk to someone who
actually wrote the code you’re having a problem with. Same thing to the mailing list.
They’re both really great resources. And if you’re going to ask for help you know
how to ask for help. This is almost verbatim, a message that came in to the Google group
a month or two ago. We can’t help you. I’m sorry. If you’re going to ask, no, give
us a little bit, give us something to work with. So know where your logs are. I mentioned
them earlier. There’s tons of logs. They can be in different places depending on where
you’ve installed them. Know where they are and look at them. If you’re getting errors,
watch the log while you’re doing the thing that’s generating the area, probably see
what’s happening. The /error_reports URL is also really good.
On Open Source you have access to this. On Cloud you don’t. And this is a nice interface
we are looking at errors and it’s searchable, it’s server errors, it’s client errors
of a Java Script close and exception that’s going to be in there. This is really great.
Okay. So, that’s the end of the story. Like I
said, this is what worked for SFU, you know. This is our Canvas. Many are like it but this
one is ours. Hopefully there’s some things you can take out of this, take it back to
your institution and work it in and hopefully you’ll be up here presenting next year on
what you’re doing to scale Open Source. And that’s it. That’s how you can get
a hold of me. I’m going to be at the Hack Night tonight. I think all of us from SFU
are so if you got any questions, come and find us, we’re more than happy to share
what we’ve learned in this. That’s it. Thanks. [Applause] Female Speaker: We have one minute for questions.
Does anyone have questions? If not, that’s totally cool but okay. Audience: So all the nodes you showed were
your current infrastructure, what did you start with? Graham Ballantyne: So his question, if you
can hear the, he wants to know the number of nodes we originally started with. So, yeah. We’ve got 22 nodes now which including
the hot spares. We started with three and one management node. So, one offline node
for delayed jobs and three user jobs or nodes. We get some load testing initially. We found
we could hit 300 concurrent users on each node, so 900 total. Concurrent’s a little
weird in Rails because it’s not a, it’s state less. You hit a URL, you get something
back your session’s done. But that’s where we are finding we are hitting before we got
really high IO and then we just had to scale it up from there. That was during our pilot
phase. So, it’s a little, we’ve got lots of head room with that, that configuration.
It doesn’t get, it doesn’t spike up too often. There are couple of heavy courses that
do file uploads and downloads a lot and we’ll see spikes on that but usually it’s very
low CPU, it’s very low memory. It kind of hums along. So, you can get away with less.
VMs are cheap so we just, well we have a large VM ware infrastructure so we just have, spun
a bunch up.


Add a Comment

Your email address will not be published. Required fields are marked *