Thunderdome

« Back to the top page
Ian Lamont

Praise and anger over The Planet data center fire

Ian Lamont06.02.2008
Tags
Comments 33
Like the story? Get Alerts of big news events. Enter your email address

An explosion at The Planet datacenter in Texas took out thousands of Web servers over the weekend, including those belonging to StatCounter, RepuMetrix, and LinkToMeet. No one was injured or killed, but some servers are still offline as the company struggles to make repairs. I read the accounts over the weekend, and received this message this morning when I logged onto StatCounter to check the traffic to my personal blogs:

StatCounter outage


(Note that StatCounter has had several recent vendor issues, such as the PayPal fiasco described here)

Despite the fire and related outages, The Planet's quick response is being hailed by some customers, such as Joseph Fiore of Repumetrix, who had this to say about The Planet:

"[I] believe that much of the reputation management drill was superbly held together by the online forum updates (rather than overloading their phone system), time oriented announcements, calm and courteous support people both online and over the phone, and phone messages in case you were out of the online loop."

However, other customers were not impressed. A writer who identified himself as Tom Tsillas, CEO of Tamion, posted a comment that claimed his company was losing "thousands of dollars" and the goodwill of more than one million members of LinkToMeet, a website Tamion operates. He also vowed to relocate his company's servers in the near future.

Were you affected by the outage? What's your take on The Planet's response or disaster recovery procedures? Leave your comments below.

Ian Lamont is the author of The Social Enterprise blog on TheStandard.com. Comment below, or email Ian at ian@thestandard.com. Follow Ian's updates on Twitter at http://twitter.com/ilamont


Comments

Our site, ThemeMyPhone.com has been affected by this.. and the updates and support has not been so great. I feel like we have been left in the dark on real informational updates. As of right now, our site is still down. We were thinking about moving our site to another server... but ThePlanet kept telling us we would be online soon.

We defiantly will be moving our sites to another provider once our server comes back online and we can retrieve our latest backup.


We now have five servers hosted with the The Planet. The two recently acquired are working fine, but the old three were in H1. Only one is online (being on the second floor), but still un-usable because the DNS was on the other two.

We started with Rackshack, and since then we have had no serious issues with the servers, even after when EV1 servers was taken over by The Planet.

This outage has disrupted our own and our customer's services.

I have read a lot what others have written. Some are facing very serious losses. The only thing I have to say is that the amount you spend on IT (servers etc.) should be equal to the importance of IT in your business. If you cannot afford outages then you should have multiple servers geographically distributed. You should have online data backups and multiple DNS servers. So any outage like this does not disrupts your business. If any one is facing any losses, they themselves are to blame. Just think for a moment. Many argue that this type of event could have been prevented, but can an earth quake or a falling comet be prevented ? I thank God that the servers are OK with all the data in them. What would have happened if the servers have been destroyed ? How long will it take to deploy new servers, and then create all the sites and restore data ?

So, everyone should have their own Disaster Recovery Plan independent of the DR plans of their IT providers.


My comment on Hacker News regarding this outage:
http://news.ycombinator.com/item?id=206105

There is a very old saying that says "If you want something done right, then do it yourself". While there is a time and place for various outsourced functions, when your business IS a website, it does not make sense to outsource the platform that keeps your site a alive.

Multiple servers at multiple datacenters is the only way to even attempt to ensure maximum uptime. Trading a monthly charge on your credit card for some computing power in a datacenter you've likely never even seen in person is a recipe for disaster.


We started with them when thy were Rackshack, then EV1 and then EV1 was bought/merged with The Planet and since then the whole service has gone down.

This is just the latest public incident of how badly this company operates.

They claim to have emailed customers but to date we have not heard anything.

The only solution seems to have better back up and have another server elsewhere.

And also maybe now a better host as well.


This has been handled very poorly. I don't understand the "praise" for communication.

1. The first notification went out 6hrs after the incident.

2. The update messages didn't provide any real info - just "we're continuing to work on restoring power."

I found out about the plan to get the 2nd floor up and running on another site hours before the Planet communicated that. There were messages that "more important" customers were having their servers moved to the second floor to be brought back up sooner, as well as other message boards of "important customers" having their machines moved to other locations. Aren't all customers important?

3. The tech staff was virtually useless - precanned responses.

When I first learned of the issue, I asked about switching machines or changing DNS and the tech support person said I was taking a gamble as they'd be up by noon Sunday. I switched DNS as I didn't want my sites down any longer.

The next day when I asked again whether it was better to just get a new machine, they agreed, but the best part is their offer was for a more expensive machine with less features than the one I just got 2 weeks ago. They wouldn't even match the specs of the machine or the price - what kind of customer service is that?

Whoever out there thinks The Planet is handling this great is crazy.

At this point as it closes in on 2 days downtime with really no vote of confidence in being fully restored, they probably could have packed all the servers up and moved them to other locations during the same time period.

I too will dump these guys as soon as I get all the data off.

Anyone with recommendations on server solutions should post them to these forums.


I have a site on Photoblog.com

the site was running briefly this morning, and now it is down again. now the urls point to this spot:
www.petapixel.com and an "error 404: not found" message.

I would very much prefer that there would be a "site temporarily shut down due to server problems. please check back later".

I have a lot of visitors to my blog site. new visitors who get a 'not found' message are unlikely to try back.

P.S. - I just checked the site again, and now it leads to a 'go daddy' page.


Yeah how dare The Planet not give us an update every five minutes stating "Still waiting on the Fire Marshall" or perhaps an update every time the temperature dropped while they were waiting for the AC to get fired up. They certainly weren't busy with other more important things like, oh say structural damage to the building. Give me a break. My website was out all weekend too. I'm out money too, but hey things happen. Good Riddance to all the "soon to be former customers'" don't let the door hit you on the way out whiners.


I agree whit those who're saying that these guys are handling it properly. Yes, it means customer dissatisfaction, and we'll need to issue some refunds to our customers too, but I'm pretty sure that it could easily happen in any other datacenter of any other company, so it looks that The Planet just went unlucky (unless somebody has a proof that they violated some regulations).

And I agree even more with those who's saying "if it's really mission-critical, don't have it in single datacenter"; those who expect that some datacenter can provide some real guarantee from this kind of things are just plain crazy (come on, when did you read your SLA last time? And what does it say about remedies in case if service is unavailable for a while except for partial refunds for service cost?)

Another question: could The Planet handle this significantly better after it happened? I don't think so. Moving 9000 servers elsewhere is not an option for obvious reasons, call centers obviously were overloaded (therefore precanned responses were unavoidable), more informative updates were probably possible but wouldn't really change much.

So I'm joining those praising ThePlanet (ok, not really praising but not blaming it either), and we will NOT move our servers out after this thing is resolved.


I am a customer of The Planet, however I was not affected as my machines are in one of their Dallas data centers. I have to say I am slightly bemused by the criticism of many regarding the incident at the H2 data center on Saturday. What happened was a freak accident, and if you think it can't happen in any other data center, you're probably quite mistaken.

The Planet has handled the incident superbly, and their affected customers are lucky that their machines are coming online as early as today! They appear to have gone to great lengths to keep down time to the minimum necessary.

If your servers are so critical to your business that you can't afford a catastrophic event such as this to take your machines down for a day or two, then you should seriously consider developing your own contingency plan, which probably involves server(s) in another data center, whether run by The Planet or some other 3rd party. Perhaps if you can't afford to do it right, you shouldn't bother at all? Who are you going to blame when an earthquake or some other localised catastrophic event causes data center downtime?


Our main website: http://www.digitalspyders.com/ was the only server affected with the servers we host at The Planet. We have been a customer since 2004 and have had a positive experience with them.

This incident happened at the best time possible in our case as we are closed for business during the hours of the outage. A few hours of wasted time on our side not being able to access information for accounting purposes and a possible few dollars of lost sales. More importantly was damage to our online reputation of being down where a lot of our PPC traffic ended on a dead site which we paid for that advertising, which was shut off immediately once it was known we were off.

Luckily all the web host servers with 1000's of our clients was in another unaffected building.

Today we will be reviewing our backup and mirroring policies and probably diverse a bit more in some redundant setups using other Data Centers.

One thing I would really like to see is The Planet lower their pricing on much of their offerings. Things are getting a bit pricey, and after being sold on world class NOC and having this happen really is a huge confidence kick in the gut. Also finding out our most valuable servers are located in the old EV1 building is a disappointment, as I don't know if they are in spec with the original Planet buildings. I will request a hardware transfer soon.

How they handle compensation on the SLA will be the determining factor if they remain our No.1 DC. I would expect at least a free month or two of hosting out of this.

Brad Thompson
CEO
Digital Spyders Inc.


Remember the young fellows in the movie Titanic who luckily (or so they thought) managed to win tickets onto the ship? I now feel like I'm living their story as just 8 days ago I ordered my first ever dedicated server from ThePlanet.com and spent the next few days moving my two main sites, www.mountainconnections.com and www.communicationsdepot.com onto my new server. Then disaster hits and for the first time ever (3 years and 8 years respectfully) we are down. Life's a trip at times!

Good luck to my fellow suffering site owners and I hope that when our sites are back up that the internet will smile kindly on us and send lots of quality traffic to make up for any losses.

Grant Burhans
President
www.mountainconnections.com
www.communicationsdepot.com


I'll preface this comment by saying I run a hosting company and lease servers from The Planet, and I'll be keeping my business there.

They showed great teamwork and communication this weekend.

Alternatively, this incident has separated the boys from the men in the shared hosting industry and has shown that many small hosting companies, like my competitors, aren't prepared for when disaster strikes the data center where they get their servers.

Sadly, many of them put their eggs in one basket/data center instead of exercising proper technique to creating a redundant system, and those are the ones crying 'I'm leaving the planet because they let me down.' NO, you let yourself down by keeping everything in one data center. Some people even use The Planet's DNS for their servers when it has become so easy to simply run your own.

Lesson - Spread your servers throughout different data centers The Planet has and you'll be okay if a bomb nukes Houston.

Specifically to shared web hosts - If you can't do the responsible thing for your shared hosting customers, you're fooling them into a false sense of security and you should find another industry to do business in.

Sincerely,
A webhost who believes in being prepared for the worst, even if it means spending more money with my data center. Oh and placing blame on The Planet is a complete cop out.


Hello Ian and company,

I made a brief post on popular webmaster forum called Digial Point regarding how GoStats was able to avoid any downtime during the disaster.
http://forums.digitalpoint.com/showthread.php?p=8000611

I hope that my advice is helpful to anyone who was affected by the outage.

Best Regards,

Richard
GoStats.com


Ian, thanks for the link love :)

I want to preface my comment by saying that we were extremely fortunate to have had a very low impact from this outage situation.

Fortunate in the sense that the main impact was on our development box, and the timing of the outage meant we were not able to further any of our development assignments during the weekend (usually a time devoted to catch-up time on development projects). I should also mention that this is R&D work specifically geared to keep us ahead of the curve in terms of providing leading brand and online reputation monitorng (ORM) solutions to the business community, and other than the benefit derived by keeping our solutions at the cutting edge, there was not direct revenue loss from any type of billable or custom assignment.

Since our blog entry, we have had some isolated DNS propogation issues stemming from The Planet's H1 outage. An email was sent early this morning to management and a phone call placed to the technical support department. Within 15 minutes I had a tech on the phone suggesting a workaround and the Manager on the other phone escalating the trouble ticket and working with the Tech to get the problem resolved.

I decided to share my experiences for one main reason - I was hoping it would inspire those who are experiencing difficulty from this outage in a way that would allow them to recognize that there is a great deal of tireless work that is going on behind the scenes to work things out for all Planet customers. Having spent a good portion of my career in IT, I have had my share of downtime - often at the worst time (I guess there never really is a good time for downtime).

This experience tops the list in terms of positive response by a vendor that led to a positive outcome. I should emphasize that on my list, there have been many examples of vendors that responded poorly (or negatively), and although a positive outcome somehow managed to take shape or form, the experience was far too frustrating to rationalize continuing to do business with them. I mention this because it is in my opinion that The Planet's handling in the worse of circumstances deserves positive reputation points.

Right from the beginning, the CEO of the company actually publicly stated that he was sorry about the accident. The accident itself was rare, but actually hearing anyone say they are sorry PUBLICLY in an age where it is far more popular to defelct culpability, is much rarer. As a veteran in the brand and ORM space, I cannot overlook the fact that The Planet accepted responsibility and took immediate action - they could just have easily bypassed the need to deal with any customers concerns in the face of chaos, or until things got sorted.

It is also highly commendable when a business has the foresight to understand reputation risk, and work at keeping customers interests on some scale of priority, albeit relative to the cause of rallying hard in fending off the greater risk of a severely comprimised business continuity matter such as one stemming from 9000 servers being without power.

I do hope a quick resolution to that those who are still experiencing problems. As so many others have already mentioned, this experience has allowed me to take a greater degree of personal responsibility in considering more thoughtfully at contingency planning.


Thanks for all of the thoughtful comments, everyone. In an era when evasion and silence are the standard communications practices at major companies, I value any firm which is responsive and up-front about crises like this. The Planet deserves praise for this.

That said, this incident has resulted in lost revenue and user anger for some customers, and that cannot be ignored. Yes, it's easy to say that firms should have redundant systems, but for many small Web firms, this is simply not possible owing to the extra work required, the additional costs, or the additional technical expertise needed to manage such a system.


Pardon me, but Ian, that's just a cop out. Like the gentleman above mentioned, if you are a web business, there is absolutely no excuse. If you are too small to afford redundancy, it is obvious that is an economical weakness in your business model - or you should buy insurance.


The Planet's response on this issue appears to be not only inept and upsetting but also a bit deceitful. Yes, they responded efficiently and gave the affected customers a continuous information feed on their forum page but they haven't appeared to provide any attempts to appease their customers grief over an issue that should not be occuring in the first place to a hosting company which claims to be one of th largest in the world and holds over 500 employees. The only effort to compensate the customers suffering over this issue seems to be that they will "calculate" down time and not charge those affected for that time. Is this an insult? I would move my own business else where and quickly. Does anyone have any advice on where to move?


Give me a call at Lunarpages.com - 7145218150 - 2808

Stephen


Hosting.com is quickly becoming one of the most reliant and successful hosting sources in the country. I would get a quote from them first. They have a great product as well as an amazing team of innovative people on staff to assist in your every need. Has anyone else been pleased with Hosting.com?


We moved 16 servers out of the facility on Sunday to another ISP and had our lawyers send The Planet's lawyers a letter indicating that they won't be receiving any more money from us and if they don't like it we are more than happy to go to court.


"Is this an insult?" - No, it's not an insult, it's SLA.

On advice where to move - we've had THAT bad experience with several companies, that I'd not risk moving to anybody except for top guys like RackSpace or DataPipe; it will cost you though.


About Hosting.com : I've took a quick look and didn't like "Request a Quote" there, as it's usually a sign of exorbitant pricing. Any idea what is their pricing for cheaper hardware (NOT virtual) server segment and what you can get for it?


To Richard from GoStats: thanks for suggestions, but one key question was not covered: do you have any database and if so, how did you manage to synchronize databases between the servers? Or you just don't care about loss of data (which vast majority of sites does care about)?


To Affected customer: Yes, a database is used, but no data was lost because the GoStats technology used to collect the data is inherently fault tolerant. Since we rely on collecting all data interrupted, it's important to not simply use the default database configuration and setup. It's a strategic advantage that we planned for from the beginning. The details of this setup are a trade secret.
There is a less technical press release here: http://gostats.com/press/data-center-explosion/

Sorry I can't give you the "how"; but if you are interested in some consulting, perhaps we could work something out.


To Richard from GoStats: so basically the answer is "we have some magic technology", and you're proposing to consult about it, which means that the whole article was written not to help others but to promote your own consulting business. A pity.

Actually, if some restrictions are acceptable (for example, requirement to have nodes absolutely coherent is dropped, which most likely you've did as absolute coherency isn't important for stats) it's not that difficult to build such a thing (for example, using simple database log rollforwarding). I was just curious if you came with something really interesting (like all-coherent multi-node database in active-active mode without incurring intra-node latencies ;-) ).


To Affected customer: The purpose of my article was to explain how to mitigate a datacenter disaster in general with special regard to DNS support. Everyone has different inner data technology needs and that's not the focus of the article. I wasn't clear if you were asking for help, just curious, or merely skeptical.

Ah, a good database discussion - exceeds the scope of this forum, but email me through gostats if you want to talk more or meet in a relevant forum.
-Yes, you could call it "magic technology" ;)


To Richard from GoStats:
"The purpose of my article was to explain how to mitigate a datacenter disaster in general with special regard to DNS support." - it looks to me that it can be written in a simple phrase "use 'DNS failover' from DNSMadeEasy", that's about it (as nginx and so on aren't really related to DNS resilience); am I right? (it's a good suggestion BTW, thanks).

"or meet in a relevant forum" - could be interesting, but do you know any such forum which doesn't need registration (I'm way too lazy to go through all that registration pages :-) )?


To Affected customer: Using nginx or any suitable http proxy is an important part of the failover solution. DNS alone will only get you do far. Nginx helps keep the servers "sane" and manages the requests sent to the remaining servers. Using an httpd proxy is the first step in managing the inner data of anyone's site.

As for a relevant forum with no registration, I don't know of any off hand. You could send me an email and we can talk more until I find one. (or you could try setting up "autocomplete" in your browser)


To Richard from GoStats: it's still unclear to me how nginx (or any other reverse proxy) can possibly help in case of datacenter-wide disaster (as nginx in failed datacenter will fail itself, and nginx in working datacenter won't be able to help); can you elaborate on it a bit?


To Affected customer: Yes, nginx is crutial in managing which servers or centers to send requests to. If one server or center goes dark, nginx can rely on the static cache or send the remaining data requests to other up servers. Using a simple apache will not be able to handle down servers in the datagroup.


Update: One of our contributors, developer Larry Borsato, is working on a piece that discusses the redundancy issue. He was directly impacted by The Planet outage. It should be published within a week or two. You can see a list of his earlier posts here.


The planet told everyone to go to the forums, but when you ask questions you get no answers and if you ask what some feel are the wrong questions one of the planets managers calling customers trolls. When I posted telling this manager it was sad to see a company manager calling customers trolls as anyone that has spent any time on the Internet knows calling someone a troll is getting close to using the "N" word with someone.

After I posted that it was sad, he banned me from posting anymore comments. I have yet to receive any contact from his bosses (I sent copies of the comment to all of them) so I can only assume that they are aware or feel the same way about the customer.

Folks systems fail, it is just how it is. But to then have the company call it's customers trolls for asking about info related to the service that company is being paid to provide is just too much for me. I have moved my server and I know many who once they get the rest of their data from their servers will be gone as well. I feel the planet should know that for some of these folks the last straw was your manager calling the customers trolls.

In years to come when things are calmed down what will be remembered about this whole incident is how at a time of crisis the customer where called TROLLS by management


We've been with them in various incarnations (Rackshack, EV1, The Planet) for over 10 years. This explosion was the first outage we have experienced of any kind. That's 100% up time for more than 10 years.

We got email from them very quickly after the incident and have been satisfied with the information provided in the updates. One of our servers suffered disk damage from one of the generator glitches and their staff worked two days round the clock running manual filesystem checks to get it back.

We will not be moving anywhere else. Most other hosting providers have far more problems just in their normal day-to-day ops. It took an explosion to cause trouble here, and they worked hard to recover from it.


Post new comment

The content of this field is kept private and will not be shown publicly.
Respectful debate is welcome, but comments that are defamatory, indecent, abusive, or in violation of any law will be removed.