Positron outages this week (Feb. 19-21, 2008)

Subscribe to Positron outages this week (Feb. 19-21, 2008) 11 posts, 3 voices

 
Avatar Robby Russell 94 posts

Hello all!

It seems that the server: positron.planetargon.com has gone down three mornings in the row. It comes back up after a reboot, but it’s consistently done this three days in a row. We’re reviewing log files to see if we can spot any unusual activity prior to the machine hanging. We’ll be posting updates on the matter on this thread. Feel free to post questions and/or comments.

We’re going to chase down a possible culprit (scheduled cron tasks by one user) and see if having them disable this resolves the problem.

Thank you for your patience as we work to identify the problem and stabilize the environment.

 
Avatar Robby Russell 94 posts

We’re going to temporarily disable the user’s cron scripts that ran prior to each server outage this week.

We’ll be conducting a test today to manually run these scripts, which could result in a reboot.

 
Avatar Alex Malinovich 44 posts

We believe that we’ve identified the problem that’s been causing the repeated crashes. We are prepared to roll out a fix for the issue, however this will require some downtime. Therefore, starting at 5 pm PST (GMT -8) today, we will be taking positron offline for up to 4 hours while we implement the solution. We are also going to be implementing a few changes that will allow us to much more effectively deal with any future issues as well as to minimize the amount of downtime experienced in the event of an error.

We will be posting regular updates throughout the process including a notice as soon as the process is complete. We will do everything we can to make this happen as quickly as possible. We realize that this has been a great inconvenience to our customers and we are doing everything we can to ensure that issues like this do not happen in the future. Thank you very much for your continued patience and support and thank you for choosing Planet Argon.

 
Avatar Alex Malinovich 44 posts

Positron is being brought down now. We hope to have it back up within 4 hours. We’ll keep this page updated with the current progress.

 
Avatar Alex Malinovich 44 posts

Looks like our assessment of the error was correct. Starting the repair now.

 
Avatar Robby Russell 94 posts

Great detective work guys!

 
Avatar Alex Malinovich 44 posts

The repair is still going strong. Taking longer than expected but it’s all going as it should be. Will update again once it’s done.

 
Avatar Alex Malinovich 44 posts

The repair has completed. The system should be coming back up within the next 15 minutes or so. We apologize for the extended downtime but it was necessary to ensure that the system will continue to function as expected for all of our customers. If you experience any issues please contact us. Thank you again for all of your patience.

 
Avatar Alex Malinovich 44 posts

Ok, the system is back up. We have verified that there was no data loss. Everything appears to have come up just fine. As before, if you have any further problems please send in a support request and we will get it straightened out as soon as possible.

 
Avatar Alex Malinovich 44 posts

It seems that Thursday night’s fix did the trick. We’ll still be keeping a close eye on the machine to make sure that there are no further problems. For those who are interested in all of the gory technical details of what exactly happened, keep reading. :) (If you’re in a hurry, just scroll down to the numbered list.)

Positron is a Linux server. A platform well-known as a very stable foundation. Positron is running a 2.6.18 Linux kernel and using the XFS filesystem. XFS is known to handle a large number of small files very efficiently so it is often used for things like mail servers, etc. It’s a very stable file system developed by Sun that has been around for quite some time and offers some fantastic features, such as on-line resizing. (i.e. growing the size of the filesystem while it is in use.) While it has been, and remains, a great choice of a filesystem for a hosting server, its history is not entirely spotless.

At some point during development of the 2.6.17 Linux kernel, a “small defect” managed to sneak into the code. In certain situations, this defect can cause a corruption in the filesystem. However, rather than do something simple like just cause you to lose a few bytes of data in that portion of the filesystem, this error would cause the entire filesystem to halt immediately as soon as that particular section was read. It is certainly possible to lose some data as a result of the bug, but the really important problem with it lies with the fact that the filesystem is brought to a halt. This puts the server in a state where it can still execute code, but any disk IO hangs indefinitely. The machine isn’t hung in the traditional sense (infinite loop), it is just waiting for IO that can never happen. Thankfully, this error was identified during the development of the 2.6.17 kernel, and was fixed by revision 7 (2.6.17-7).

But, as stated earlier, Positron runs 2.6.18, at which point this bug was no longer an issue. So, obviously, hardware is the likely culprit, right? Well, yes… except for one problem…

As Positron is a hosting server, and as we take the responsibility for the safety of our customers’ data very seriously, the hard disks are set up in a RAID 1 configuration. If a disk ever fails, the other disk has a perfect copy of all the data and the system continues running without a problem. It will notify administrators that a disk has failed so that we can replace it, but there’s no service interruption. This includes anything from a few simple read errors, to a complete mechanical failure of the disk. The whole point of such a setup is ensuring that data is safe and that there is little to no downtime involved.

So getting back to our thrilling little mystery story here, we now have something of a paradox. The system is set up in such a way that any disk-related hardware failures can be dealt with without anyone other than the system administrators even realizing that there is a problem. But, needless to say, we weren’t the only ones to realize that there was a problem here. This was hint number 1. Not enough to crack the case, so to speak, but a step in the right direction.

Now, due to a marginally related issue, we just happened to be looking at some files on Positron. In particular, we were looking in /lib for some kernel modules. And what shows up? The kernel modules for 2.6.17. Turns out that the machine was originally set up with a 2.6.17 version kernel. (.17-1 to be exact) So, fast forward to the recent problems, and it becomes apparent what happened:

  1. Positron is set up initially with kernel 2.6.17 and the XFS filesystem.
  2. It runs this version of the kernel for a little while before being upgraded to 2.6.18. Prior to the upgrade, however, a few portions of the filesystem get affected by this bug.
  3. Some time goes by, the affected sectors turn out to not be in frequently accessed portions of the filesystem, and everything seems just fine.
  4. As Positron is a hosting server, and new users are added regularly, all of whom add new data regularly, at some point the portion of the filesystem that experiences this error ends up holding a regularly accessed bit of data.
  5. Next thing you know, we’ve got outages at 4 am for a few days in a row.

Once we determined that this was, in fact, the issue, it was time to resolve it. There were a number of interesting technical solutions that we implemented in order to make this a bit easier (which I may get into in a future post). But the long and the short of it is, we were able to take the server offline, run an updated version of xfs_repair on the filesystem to find the errors, and then get the machine back up and running.

With these errors fixed we’re confident that Positron will continue to work well for a long time to come. Once again, thank you to all of our customers for being so patient and understanding through this process. We understand that your time is valuable, and that your site staying up is extremely important, but your understanding of the difficulties we were faced with during this period made it much easier to focus on the problem at hand and to get a resolution applied as quickly as possible. Thank you again, and thank you for choosing Planet Argon.

 
Avatar thorny_sun 3 posts

thanks for the details—i was getting curious—and gotta say damn impressive detective work. it’s a good thing you’re fixing the boxes and not me.