|
Webmaster notes March 1, 2006 |
Webmaster's note: Behind the scenes, janisian.com is hosted on a "root server" at 1and1.com, running Fedora core Linux. 1and1 has some fairly swift admin tools and generally has a good product in these root servers. Just don't have a problem on the weekend... Like we did on October 27, 2006... Day 1- Friday : "My website's down". I had just come home and sat down for dinner. These words caught me with a forkful of mashed potatoes, so a few seconds of chewing before I was able to respond with "You mean you can't transfer into it, or...?" "No, I can't see it when I click on the internet". In this case, the client is my wife, and the website in question is sewblessed.net, a site on this server for her fledgling embroidery business. I am mostly hands off on that, as my son and wife take care of the design and administration on that one. I just handle the hosting, primarily. As my wife is more the artistic type and not really technical, sometimes symptoms are a bit mis-stated. I don't want to think about the physical impossibilities implied by the statement "Click on the internet"...I know what she meant to say. Excusing myself from the table, I go to one of our systems and type the URL.
|
An actual technician at 1and1futzing with a server. Notice that there is no telephone to be seen. I hope he's not changing a hard drive...
|
>Site not found. Hmmph. Fire up a command line and ping sewblessed.net The mind starts to race. As I said, that's a fairly small site, but it shares a server with some very important sites...and the most important one is... ping janisian.com Crap! Is it DNS? ping 82.165.187.71 Several things can cause this symptom, all of them bad. My mind ticks them off, with "Failed hard drive" topping the list. Mentally I think about how long it has been since the cron job has run that dumps the /httpd directory off to backup...a while, but most of the files should be OK. The databases may be tricky to restore, but they always are. I fire up PuTTy, my SSH client, login to the serial console, attach to the serial sever, and tunnel across via serial connection to the webserver. Login as root....and get a perfectly normal root prompt. This is promising. Next thing, navigate to the /httpd directory. Everything is there. Good news. Maybe something just stopped. Fastest way to bring it back up is to hit the server with a reboot command: shutdown -r now Server goes down, comes back up. Symptoms unchanged. I can connect via serial, cannot connect via the IP address, server won't ping. I go with a tracert, and determine that IP connectivity will get me all the way to the datacenter which is located in New York, but it fails at the last hop to the server. So, mentally I have the problem down to: A bad NIC (Network Interface Card, aka a ethernet card) None of which I can fix from Nashville. First I drop Janis an e-mail to let her know that I am aware of the problem and am working it. Next I call the server support line at 1and1 and get "Rep#1". I have no inkling on what a twisted path I have just embarked on. We go through the ID verification process and then I go over the above steps: Me: "So it looks like we've dropped IP connectivity somewhere. Can you have someone go over and plug that hunk of ethernet back in?" (Note that at this point, I am thinking this is not a big deal...I've cleared many a ethernet issue in my day...) Rep#1: "Hold on.....That's weird" Me (alarmed): "What's weird??" Webmasters Note: Statements like "That's Interesting", "That's Weird", or "How long since you ran a backup?" when uttered by technical folks should be considered warnings that things are not going as well as they should be. A full list of these warning signs can be found here Rep #1: "It's going to take a few minutes to clear this issue. I will e-mail you with a status report". Me: "OK, but keep in mind that the e-mail address you have on file resolves to this server and cannot be reached. Please send the status report to this alternate e-mail address... Rep#1: " OK (pause) It may take an hour to get this cleared up..." Me: "Ok, understood". Ok, down for an hour, but they are aware of the issue and are working it. Being down for one more hour is an annoyance, but the files are ok. Things could be much worse. I sit down to go over my remaining e-mail (having 4 of your main e-mail addresses down will diminish your e-mail intake, I found) and kill some time with a few rounds of "Ricochet". An hour goes by, and I receive an e-mail....and the bottom dropped out: Thank you for contacting us. I booted your server to the rescue system and it lost its hard drive. If you have any further questions please do not hesitate to contact us. "WHADAPHUQ!" I exclaimed...."No, No, No!!". The implications of that e-mail were daunting. If they had installed a base image, I can restore most settings easily (thank God I host on Linux and not Windows), but I will need to double check every function that the server provides...and I have a full plate already in getting all of the changes in place for the holiday sale. I don't Don't DON'T need to be settling into a new image during the holiday crunch. But things are unclear...are they installing a base image, or pulling MY image from my backup area? What does "Released" mean??I shoot back a terse reply: Will this process cause any sort of data loss on the server? Please advise and wait. and wait. Turns out I will be waiting for that status e-mail for a very long time. I'm still waiting, actually. A few hours go by, with no response. Eventually I try to connect to the server via serial again,. but where before I got root access easily, now I cannot connect at all. Nothing. I assume they are using an image management app like Altiris to handle this sort of thing, so re-deployment will take a few hours. So, might as well go to bed. Day 2 - Saturday: 3:30 AM. I can't sleep. I can either lie awake in bed and worry about the server, or I can do something about it. As any sort of re-image should have completed by now, I go to my system and ping 82.165.187.71 "CRAP!". I grab my headset for my cellphone, slap it on, and call the support line. After all, 1and1 offers 24/7 support, and that includes 3:30 AM. Rep #2 answers the phone. From the accent, Rep #2 did not grow up on this side of the Atlantic. However, that doesn't make a bad support agent, so long as he is competent…is he? Me: "I am checking on the status of this trouble ticket”. Rep #2: “Oh yes. It looks like a bad hard drive. They re-imaged it last night” Me: “So the re-image is complete?” Rep #2: “Yes sir”. Well, the “Sir” bit is kind of nice. Me: “Ok, Rep #2, there is a big problem here. I called in because the server did not have IP connectivity. I received an e-mail telling me the hard drive had failed. I don’t think that that is accurate, because I could still navigate the files via the serial console. The hard drive was replaced and the problem is still the same. It looks like a bad NIC to me” Rep #2: “Yes sir. The hard drive is bad because we could not boot into the recovery system.” Webmasters Note: The “recovery system" that is referred to is a second operating system loaded onto the server hard drive. It is possible to boot into that OS if you really, really screw up the primary OS, and then go in and fix whatever you broke. Since Linux is completely controlled by small text files, it is possible to fix it in some pretty extreme circumstances. Me: “That doesn’t sound right. Again, I could still navigate my files until your technician did whatever did to put it into recovery mode. Rep #2: “Well, it could be a boot sector problem” Me: (thinking) Crap, he’s right. A failed bootsector would….wait a minute…”No, I don’t think that was it…I was able to reboot it after I found the problem last night. With a bad bootsector, I couldn’t have done that. And I could see me files last night. This really looks like an IP problem to me, not a hardware problem. Rep #2: “Well, we can see if we can recover the files” Me: (with voice getting lower and slower, which is how I control my temper): I have a real problem with that. I called in for a simple IP connectivity issue, and you’re telling me my hard drive has failed, when I have very strong reason to believe that not to be accurate. If the technicians replaced that hard drive and the issue did not clear, did they put the original drive back in?” Rep #2: “It doesn’t say here…” Me: (Lower and slower yet) “Let me be very clear with you right now. If you guys have erased all of my data because of an IP connectivity issue, that is going to be a major problem. I have no indication that the hard drive is bad other than the recovery system not booting, and after the fix it apparently STILL won’t boot. Therefore, the fix did not address the root cause of my issue, it was not a fix, and this server is still down. “ Webmaster note: Yes, I actually talk that way when talking to support reps. In a previous career path, I have been a Level 2 technician for a major computer manufacturer, and did some time in the escalation queue, fixing issues that other technicians have not been able to resolve or have made worse. When I go into “L2 mode” I start to REALLY sound like a geek. Rep #2: “Uh…” Me: “is the original hard disk around still? Did the technicians retain it?” Rep #2: “I don’t know, sir” Me: “The server was running until the attempt was made to put it into recovery. We need to reverse what has been done because we have not cleared the issue. Rep #2: “Let me try to put the server into normal mode” ….clickety click, sound of keys being typed upon….”OK, it’s coming up…huh…well, that’s good news…” Me: (Ears pricking up) “What?” Rep #2: “ It looks like they haven’t changed the hard drive yet, the files are still there”. I open a command prompt and try it again: ping 82.165.187.71 Me: “Well, I still can’t hit the server, let me try this…” I hook up to the serial console, log in…and can see my files. Goodness. “ Ok, my stuff is all there. Can you have them hold of on anything until I can dump the SQL databases out and back them up?” Rep #2: “Yes sir” Me: “Great, I’ll call back in as soon as I have made sure all of my backups are current, then you all can do whatever to clear this issue”. We disconnect, and I start working on transferring files out via the command line. The server backups are via FTP to another server in the datacenter. I will confess right now, I very seldom have to use FTP from the command line, so I grab the “Linux Administrators Black Book” and flip to the cheat sheet for the FTP command. I turn back to my system, and my shell window has closed. Hmmph. Must have timed out. Bother. I reopen PuTTy, hook up to the serial console…. Nothing. No login prompt for the server, no nothing. Completely no communication. My mind goes over the possible scenarios and the one that seems most likely is that some technician has showed up in the datacenter with the new hard drive, found that server running, and shut it down for the hardware swap. This would be a bad thing...particularly since I think it is completely not necessary. Back on the phone…through some telecommunications miracle, I actually get Rep #2 again… Me: “Dude, I got kicked from the server. Are you sure you told engineering to hold off on changing that hard drive? ” Rep #2: “ So, did you get your files backed up?” I may not have mentioned it, but I suspect that English, or at least American English, is not a first language here. Me: “No. that’s the problem, I was connected, now I can’t connect” Rep #2: “Well…” Me: “Look, I need to say this again. I could see files fine until *whatever* happened on your side. I don’t think this is a hard drive issue, I think it is network connectivity. Can you get to the datacenter and have whoever is in that server stop what they are doing and check the IP connectivity, please?” Rep #2: “ Erm, no, I cannot” Me: (Clouds starting to part…I am starting suspect a basic problem here…) “Where are you located?” Rep #2: “Sir?” Me: (Slow and Deep) “Where is the call center that I am speaking to right now”. Rep #2: (hesitates) “London”. Ah, so. Call center in London, data center in New York. Logistically, that makes getting anything done a bit of a challenge. Me: “I see. Ok. Listen. I need the hard drive that was in that server put back in, and then IP connectivity restored. “ Rep #2: “Yes, sir” Me: “And I need status updates on this. When can we get this done? “ Rep #2: “I don’t know” Me: “Look, that’s not getting it. I logged this fault “ …looks at clock…5AM…”12 hours ago, and was told it would be cleared in about an hour. 12 hours later it is still down. I need to emphasize that I have an artist out on tour right now and this server is crucial not just for the website but also e-mail connectivity. We have to get this cleared”. Rep #2: “Yes sir, I understand” Me: “OK. So, how long do you project to clear this fault?” Rep #2: “I don’t know” Me: (growl) “Will it be closer to 5 minutes or four hours? I need to give a status report to the folks that pay for this thing…” Rep #2: “Four hours” I knew he would say that. 4 hours is enough time to completely take a server down to the motherboard and put it back up, if necessary. Hopefully they’ll be able to get the hard drive back in and then get the real issue cleared. We disconnect. It’s now after 5:00 in the morning, and I’m completely wired; Might as well begin the day. I grab a shower, whip up some breakfast, and basically pace while the clock slowly turns towards 9:00. At 9:00, I try to ping the server. ping 82.165.187.71 Connect via PuTTy, no connection. Basically no change from where we were at 5:00AM. I fire an e-mail out to Janis with what I have so far. And resign myself to wait. I have other work that needs to be done, but I quickly realize that without the website being live, a lot of it will have to be repeated once the server is back up…besides, I can’t concentrate, about every 10 minutes I do a: ping 82.165.187.71 Then sigh and continue my work. By noon I give up and walk outside. Out to the garage and fiddle with the bike a bit, the whole time mentally listing out the tasks I will have to get done to get the server back into full production. It’s a long list, long enough that I decide I really need to avoid it. Back in the office, back on the phone. Me: “Hi, I need the status on this troubleticket" Rep #3: "Let's see. It says here that there is a hardware problem, and it has been submitted to the systems administrator" Me: "Yes, that's right. The trouble ticket was filed almost 24 hours ago, and the issue hasn't been cleared. I need to know the status now." I relate the story above, going into detail about the lack of IP connectivity and the fact that I don't think we REALLY have a bad hard drive at all. And get no where... Me: "Out of curiosity, where are you located? Rep #3: "The Philippines" Ok, the Phillipines...not London...so, not only am I calling a call center without a whole lot of ability to actually get something done, I am talking to an outsourced call center...which is both good and bad. Bad is that the rep I have on the phone is not likely be able to actually do much more than read from the logs and make reassuring noises, good because I am not at risk of getting my hopes up. Me: "The Philippines are nice. Some friends of mine set up some call centers out there... Rep #3: " Yes sir. Anything Else?" Webmaster's Note: Call center reps live by the clock. The measurement (or "Metric") is usually referred to as AHT, or "Average Handle TIme". Chatty customers can kill your AHT. Rep #3 knew this. Her cadence and mannerism were all about getting me off the phone as quick as she could...since she couldn't actually DO anything, that makes sense. On some planet, anyways. Me: "Yes. I was told that this issue would be cleared almost 18 hours ago, and I can see no change in status, and you've got nothing for me at all. I have yet to receive a single status report, which I keep requesting. This server being down is creating a major issue for us. I need the service escalated on this". Rep #3: "Yes sir, I understand. I can escalate the ticket". She did NOT add "So now we can ignore you with a much greater sense of urgency" Me: "Thank you. When can I expect an update on this?" Rep #3: "I don't know" I am detecting a pattern here, one that I am not terribly wild about. I try to hang on to the faint hope that "Escalate" means the same thing there that it does at the company I work for...an escalation is what gets a support case in front of the second level support, someone who is more focused on resolving the issue and taking care of the customer. Well, some people believe in the easter bunny, too. I disconnect, drop a status update to Janis (a pretty short one, really) and start organizing all of my files and laying out a plan to bring the server back up if the data is lost. Frustrated, I eventually turn in for the night, feeling like I have wasted most of a day and accomplished nothing Day 3-Sunday: 3:30AM The day starts out just as the previous day did: ping 82.165.187.71 MMMmm. Yeah. Back on the phone, call into 1and1, talk to another rep in London. Me: "I need a status on this trouble ticket. This thing is going on 36 hours now, and it is still not cleared. What is going on?" Rep #4: " I show the ticket as closed at 2:00 this morning". What? Me: "That's not correct. The issue is still going on, the problem is not resolved". Rep #4: "The hard drive was changed and the image restored at 2:00" Me: "What image was restored? Were my files restored?" Rep #4: "I think so. It looks like it..." Me: "Well, the server will still not respond to a ping, and THAT is what I logged the ticket about, not about a failed hard drive. Are you SURE that the files were restored?" Rep #4: "...erm...no..." I'm hate having to repeat myself, but I again go over the chain of events, the long wait, lack of status updates and most of all, the very real damage to Janis and the label from the prolonged interruption of service. I'm up and pacing as I talk, and while the rep I am talking to is sympathetic, he is no more help than the last one. The call goes nowhere. I disconnect and, due to the lack of any kind of status report from 1and1, formulate a theory as to what is actually going on...it goes like this: The actual failure is the NIC, but the NIC is "Down" on the motherboard. 1and1 is not sourcing the server from a top tier vendor, and the warranty level on the server will not get a replacement part there until "Next Business Day" which would mean Monday. Webmasters Note: I have no information at all as far as the quality of 1and1s server or the warranty level, but my theory fit the few facts that I had at this point.. So I resign myself that nothing will actually resolve until Monday. Not great, but accepting that may allow me to focus on something else. Yeah, right. I go over everything that has transpired, and start thinking about the last time I saw my files...when the rep in London restarted the server into "Normal mode". How did he do that? Maybe... I check the admin panel from the 1and1 admin website, and look at the "Rescue mode" options. There are three choices...Rescue, Debian 3, Rescue, Debian 4...and "Normal Operation". What the hell, it can't hurt at this point. I open my SSL client to watch the command prompt, then toggle back to the "Rescue Mode" website...click "Normal operation"...and click the button. Then I wait. In the SSL window, in about a minute, a series of control characters goes by...and then...a login prompt. "Holy crap" I exclaim...I login as root, navigate to /home...and there are my folders. Either they didn't change the hard drive, or they migrated the data from the old one to the replacement. I don't care, really. I issue the command to connect FTP to the backup server....and fail to connect. That's right...I am having IP issues, so a FTP transfer of files is not going to possible. A ping from the server to the IP address of the backup server confirms that our server cannot see the LAN that it is on. Good stuff. After about a minute of working at the command prompt, my SSL window suddenly closes. I try to reconnect...and again get no response. I think I have it now...I go and restart into normal mode via the admin webpage, then reconnect via SSL. And again I can log in, and see the files. In my mind I have now absolutely confirmed that the issue is IP connectivity, and that once that is restored, everything else will shake out just fine. I feel better, but also realize that one more problem is on the horizon...We are coming up on 48 hours that the server is down, and at 48 hours any e-mail being sent to any of the domains on that server is going to die. That means that non-delivery messages will start going back to anyone who has sent an e-mail, which is one thing that you NEVER want to happen if you administrate e-mail. It just looks bad and makes your cell phone ring. I call Janis and go over the options...I can flip the DNS information to another mail server I administrate, but that will lead to a longer service interruption while we wait for DNS to propagate and later propagate back. Had we known on Friday that we would be down until Monday it would have made sense to do that, but in the end the decision is made to stay put until Monday, and then go Thermonuclear if the issue is not resolved. Day 4 - Monday:: It's Monday, so I have to go to my day job. With the daylight savings time thing, I get up extra early and go into the office, figuring to continue to fight the battle from there. I call into 1and1 and start the call the way the last four has gone. Unlike the previous reps, I will put in this rep's name, but only because of the very different way she handled the call. Me: "So, just out of Curiosity, where am I calling?" Ebony: "Pennsylvania" Pennsylvania? That's where the US headquarters are. That means that I am not talking to outsourcers, more than likely. I start, against reason, to hope... Me: "OK, I need to get a status on this trouble ticket. I've been down since Friday, I have called multiple times, and really haven't gotten anywhere. What is the timeline to get this resolved?" Ebony: "Let me see what I can find out, Can you hold on?" I agree, and listen to some hold music...Led Zeppelin's "Going to California", which is cool...but then it rolls to Steve Miller "Fly Like an Eagle" Gahh! Steve Miller is crap...well, at least it's not "Abracadabra"...now, THAT is a stupid song. Still... Ebony: "Hello, Sir? It looks like the ticket was closed Saturday night on this. I have re-opened it, but there are a few in front of you " Me: "OK, Ebony, I'm not mad at you, but I need to escalate this. We've been down for close to 3 days now. At this point, I am going to have e-mail bouncing and this may start to hurt our search engine standings. My ticket should not have been closed, and I really have to make sure that my issue has priority. This should be considered Sev 1 at this point. I have to have a time line for resolution". And Ebony, lovely Ebony, listened... "Sir, I understand. When you were on hold, I told my Supervisor, Tom M, about this. He is calling the datacenter right now to find out why this has not been resolved. Can I have him call you back with a status?" Of course, I stil had not gotten back a single status report that I had been requesting for the past three days, but I'll give this a shot. She sounds like she is telling the truth. At the very least, I now have a supervisors name, which is not an easy thing to get. An hour goes by, during which I am working on some other things. After an hour, I have not received a call, so I call back in, and try a different tact Me: "May I speak to Tom M, Please?" Rep: "Yes, hold on..." ?! Hold on? I wasn't expecting THAT... Tom M: "Yes sir. I just got off the phone with the data center. It looks like we should have you back up in about 15 minutes. Turns out to be a routing problem in our datacenter" !!! Me: "I figured it was an IP thing..." Tom M: "Yep. We actually tested your NIC, but it was ok. We have Port 80 open now, we have to do a few more changes to restore all connectivity, but we'll have it cleared in about 15 minutes. And listen, I am really sorry this got dragged out like this" We chatted for a few minutes about support issues and shared some call center war stories. In the end, Tom was good enough to give me his e-mail address in case we have another weekend issue. Hopefully, we'll never need it... I called Janis and let her know that we were going back up, and to get ready to be deluged by e-mail. I check the domain and everything is again functioning correctly. As a last check , I pop into the message board control panel and look at the new user registrations. How nice! SexyBimboCam has signed up as a new member! Yep, we're back... John |
|
© 2006 John Leonardini. Used by permission. All Rights Reserved. Commiserate with the webmaster: Click here |
| Since 1998 |