Emergency Positron Maintenance -- 4/28
|
|
There will be an emergency maintenance on Positron this afternoon starting at 1:30 pm PDT (GMT -7) and going until 5:30 pm PDT. During this time we will take Positron offline, remove the drives, and image them onto another machine. If everything goes smoothly we should be able to bring Positron back online considerably earlier than 5:30 pm. We will be posting updates on this thread as the process goes on to give people an estimate on when things will be back up and running. If you have any questions or concerns about this process please feel free to post them here and we will answer them as quickly as possible. |
|
|
Positron just went down again. As it is less than 30 minutes before the scheduled downtime we will not be bringing it back up again. We will begin the migration in a few minutes and use the additional 30 minutes to try and get it back up and running as quickly as possible. We’ll be posting more updates as we have them. |
|
|
The migration has been started and is progressing smoothly. We’ve transferred roughly 30 GB of data so far, out of a total of about 120 GB. At the moment the transfer has slowed down considerably as user mailboxes are being copied. This causes an extremely large number of extremely small copies to be done which tends to slow things down. Once we’re back to sending nice big files again it should speed up considerably. More updates as we have them. |
|
|
Just a brief update about the progress. We’re at about 40 GB of data transferred so far. The last few gigs have been extremely slow due to the afore-mentioned huge number of small files. We’re a little over halfway through our maintenance window, and about one third of the way through the copy. As long as the small files get done in the next 30 minutes or so, the rest of the data should be done on time once copying gets back up to the speed it should be going at. If we do run over the time limit we will allow the copy to continue. Having had nearly 10 crashes in 3 days bringing the machine back up in its current state would be a very bad idea, so we’re going to let it do what it needs to do and bring it up once its fully copied and stable. We’ll keep posting updates here as the copy progresses to give you more accurate estimates of when to expect the server to be back up. |
|
|
Can we get a status update? Will Positron be back up in the next hour? |
|
|
The copy is still going strong, but slow. We’re hoping that it will be completed within the next 2 hours or so. One of the things that is seriously slowing down the copy is old session files floating around, so we’re clearing those out as we find them while the copy is going. As always, more updates as we have them. |
|
|
We’re about halfway there so far. At the 60 GB mark. The number of old session files we’ve found so far is nothing short of staggering. I’ve deleted 500,000 of them so far, and more keep cropping up. Unfortunately, it’s a constant uphill battle as the best way to find them is by seeing the copy speed drop drastically, but by that point the copy of the files has already been started, so deleting the files slows down the copy further (and some of those files still get copied over no matter what). Once again, I’m going to estimate roughly another 2 hours. If we could keep the full sustained transfer rate of 30 MB/sec that we get on large files everything would be done very quickly, but every time a directory with small files is encountered, the effective transfer rate drops to between 400 and 2000 KB /sec. We do apologize for the extended downtime here, but this is necessary as the repeated lockups over the last few days are not only inconvenient, but are putting us at ever greater risk of data loss, which we are trying to avoid at all costs. |
|
|
Yet another setback. This time, however, we might have found the cause of a large number of our problems. We’ve found one customer that had a directory with millions of session files. We don’t have an exact count as the directory listing was taking too long. At the time that I killed it we had gotten up to 5,000,000 files and still counting. The XFS filesystem (which Positron is using) is one of the few filesystems out there that can support millions of files per directory. But just because it has the ability to do so doesn’t mean that it’s good for the system. With this many files in one directory, doing a simple ‘ls’ on the directory during high system load could be enough to crash the filesystem. And if the root filesystem crashes while the system is running, the system goes with it. So the good news is that this problem has been isolated and we are now in the process of cleaning up this directory. The bad news is that we’ve had to completely stop the copy process for now, while we wait for the delete to run. Once the deletions are all done we can start the copy back up and hopefully have it complete relatively quickly. We will post another update once we have a better idea of time remaining. |
|
|
Everything is still progressing as planned, albeit quite slowly. We still plan to have Positron back up and running by Tuesday morning. I will be working on this through the night to make sure that the copy completes successfully and that Positron comes up as planned as quickly as possible. |
|
|
The file delete is still running. With 2,500,000 files remaining, and an average of 60 files per second, there’s a good 12 hours of deleting ahead. Obviously, we cannot keep the server down for another day, so it will be back up shortly. Some of you might recall that during one of the previous maintenance periods for Positron we had tried to repair the filesystem in the hopes that that might fix the problem. That repair did find a number of errors, but it was never actually able to complete. It would repeatedly error out with an out of memory condition. Having found the directory earlier today containing millions of files, this is now hardly a surprise. I just ran another repair on the filesystem and for the first time ever, the repair actually completed. More importantly, it actually found and fixed a number of outstanding errors. Since the full migration would currently require up to another 24 hours, the plan now is to bring Positron back up as-is (with a repaired filesystem) within the next hour. If the repair has caught all of the errors this will provide us with a few different benefits.
This does not mean that the problem is necessarily fixed. There still could be stability issues moving forward, but this puts us in a much better position to deal with those issues in the future. Most importantly, it allows us a predictable downtime window for doing a final migration. For those who are interested in a few more boring technical details, the offending directory contained a large number of files. Given the amount of time the delete has been running today, the average deletion speed, and the number of files remaining, we estimate that the directory in question originally contained over 8 million files. One of the effects of this large of a number of files is that the directory entry within the filesystem is currently 186 MB. In other words, for any operation on any file within that directory, the system needs to read and process 186 MB of data. This includes creating files, removing files, or even just updating the timestamps on existing files. Once we get the last of the session files deleted we can recreate the directory from scratch which will start it off at a nice manageable size of 6 bytes. Beyond that, it will probably grow to a few dozen kilobytes total. The hope here is that this will alleviate most or all of the stability issues we have been seeing to date. |
|
|
Positron is coming back up now. If you experience any problems with the server please let us know and we will look into it promptly. |
|
|
Hi, I keep getting [this date] 503 Service Unavailable The service is not available. Please try again later. |
|
|
Positron itself appears to be working fine. Which domain are you trying to access? |
|
|
hunterword.com (Customer ID: MORG-002) |
|
|
It looks like there’s a problem with your rc.local file. It’s not clearing out all of the old PID files, so mongrel didn’t start up properly. I manually cleared out the PIDs and now mongrel starts just fine. (Though you seem to have an application error.) It would also probably be a good idea to update your rc.local file to make sure that you’re clearing out the appropriate pid files. |
|
|
Thanks but sounds like Greek to me. Am contacting former student techie to help me “update rc.local file …” Thanks again. |
|
|
It’s up and running. I’m happy that I didn’t cause the problem (I’m trying to learn this stuff). Anyway, my former techie emailed me this: It looks like Planet Argon upgraded something on their end which caused a problem with way that the image processing plug-in the site is using told the application framework to load the image processing library it relies on… all of which manifested itself via an extremely misleading error message pointing at code file that had nothing to do with the actual problem. In other words, not something a non-programer had much of a chance of fixing. I also fixed that problem with the pid files (I think; can’t tell for sure until the next time PA restarts the server) and got the app running with the newer application server PA is recommending now, which might make things a tiny bit faster. So, Thanks. |
|
|
Positron down on 4th of July? 9:10 a.m EST. Just wondering. |
|
|
Well, can’t access http://positron.planetargon.com/vhcs2/ So, I have to assume that the Big P is down. On the 4th. Bummer |
|
|
The Big P went up on the 4th, sometime after my last post. So, the big question; My former student techie suggested that I introduce myself to programming. Any ideas from anyone on how to get started? I got two book projects and one mini-documentary in the works but I’m ready to invest the time. I’m open to suggestions. Right now, I’m reading PA’s “Getting Started,” but in a few weeks, I’m going to need more. Thanks. |
|
|
@gwm_hunter:
Yeah, we had an outage on Positron for a few hours that morning. Our remote reboot wasn’t working on it so I had to head down to the colo facility to do a hard reset. It’s been back up and running since then. |