Post Mortem on Site Outage

avatar of Fiz

Posted by Fiz

18 April 2014 at 10:09:03 MDT

Hey everyone.

We just wanted to update everyone on what occurred during the recent outage.

At around 7:30 PM EDT, the Weasyl mainsite, Redmine, and status page all went down, giving users a connection error. When we traced the issue back we found that our RAID card on our main ESXi server had an ECC fault in its memory module, which safely brought the array offline. It took a bit for us to get back into the ESXi as all VMs were migrated to a single host, part of a seamless server migration last week.

Once we gained access to ESXi we checked the status of the array and got the host back online. A bit of work with ESXi to re-attach the array, and we continued on. Once we powered up the VMs we found a memory config problem with the DB VM as well as a network config problem with our main app server. We were able to address these issues and get the site back up safely at around 10:30 PM EDT.

We’re not, however, just leaving things at that. We’re going to be improving our ability to manage the VM cluster in case of another failure. Also we’ll be looking into improving our network config, memory config, and creating some redundancy across arrays for VM storage.

We apologize again for the outage and thank you all for your patience as we addressed the issue.