New Positron Problems
|
|
We had another outage on Positron this morning starting around 3:30 am. We are going to be implementing a few temporary measures to make sure that we can get as much information as possible about the outages as well as trying to make sure that any outages last only a few minutes. This is going to involve the following:
We will be posting updates here as this process progresses. Thank you for your patience with these issues. We’re working hard to address them as quickly as possible. |
|
|
Ok, the syslog monitoring has been set up. If this machine misbehaves again we should be able to get useful debugging info. Steps 2 and 3 should be taken care of in the near future. |
|
|
It’s down now..
$ date
Wed Feb 27 19:39:29 PST 2008
$ ping positron.planetargon.com
PING positron.planetargon.com (198.145.115.23): 56 data bytes
^C
positron.planetargon.com ping statistics 12 packets transmitted, 0 packets received, 100% packet loss |
|
|
Thanks for the note Bryan. I rebooted it once earlier and it locked up again within a few minutes. I’m now running it in a recovery environment to test for any errors. I will try my best to get it back up quickly tonight. We will try to get a more permanent solution in place before this weekend. |
|
|
We’ve got a full plan for the server at this point. Later today, we will be moving positron to new hardware. There is going to be a little bit of downtime during this process, but we hope to have it all done in an hour or less. Once that move is done we’re going to be doing two different tests. Test one is just letting it run on new hardware and seeing if any problems come up. If there are any issues on the new hardware we’ll know that it’s a software issue. We’re also going to start a stress-test on the old hardware. We’re going to run a testing suite that will max out processor, memory, and disk usage for a number of hours. If there are any problems with the hardware this test should bring them out. We’ll be posting more updates here as we go along. |
|
|
This process has been slowed down a bit. We’re having a hard time getting the data copied onto the new server over the network. We’ll be revisiting this on Monday and hopefully doing the migration to the new hardware then. In the mean time, we have a number of shorter term solutions that have been put in since Wednesday to make sure that we are aware of outages more quickly, and that we get more information the case of an outage. We hope that we will have an uneventful weekend. As always, if you notice any outages, please let us know. |
|
|
This got pushed back slightly again, but we’re now planning to do the migration this afternoon. This should give us time to run some tests and get a permanent solution in place by next week. Unless of course it turns out that it was the hardware, in which case the move will end up being the permanent solution anyway. We’ll keep updates posted on here with more info about when the migration is starting, if/when you can expect any downtime, etc. If you have any questions or concerns about the move feel free to either post here or submit a support request. |
|
|
Ok, positron is going down in just a few minutes for the migration. We will be copying data directly from disk to disk so it should be pretty quick. We’re hoping to have everything back up and running before 7 pm PST. (2 hours from now.) |
|
|
The positron migration has been aborted for the time being. We will try it again in the next few days which will, unfortunately, end up causing more downtime. We understand that this is extremely frustrating to our customers. The only condolence we can offer is that we are experiencing the same frustration. Everyone involved would be much happier if these things just worked right. :) For those wondering about some of the more technical aspects of this process, the problem is relatively simple. Dell’s enterprise class servers, with top of the line RAID boards, have no possibility of putting in a drive with data already on it and using it. Any drive that is plugged into this server must be wiped before it can be used. This is obviously a problem when we’re trying to make sure we can keep the customer data available. We apologize once again for the extended downtime and for the general instability of this server. We really are doing everything in our power to fix it, but we are just not having much luck at this point. We are going to re-evaluate our options tomorrow and try to find a way to move away from the current hardware with as few headaches as possible for our customers. |
|
|
From bringing Positron back up just now, it is beginning to seem more and more like there’s a definite hardware problem with something other than the disks. RAM, CPU, or potentially a board problem. We will be getting in touch with Dell tomorrow to find out what we can do regarding warranty support for this machine. If we can get a new machine without having to send this one back we will just move the disks over to the new machine and bring it up and see if the problems go away. If that doesn’t pan out we’re going to start migrating users one by one off of Positron. Unfortunately, with this machine being one of the newest and most powerful of the bunch, there are a LOT of users to be moved across. If we do end up doing this migration we will do our best to minimize any outages while we’re still running on old hardware. We will also be getting in touch with customers to find the “best” time for a brief outage for one user at a time. During this time we will move just that user over to a new machine and verify that everything is working. We apologize again for all of the recent problems with this server. We have a long and bumpy road ahead of us but we will do our best to provide the best service out there. You, our valued customers, chose us for a reason, and it is our goal to do everything in our power to make us the best solution for your hosting needs. Thank you all again for your continued patience through this process. |