A Play-by-Play Retelling of Yesterday’s Downtime

Thread Started By Second Life

1480
0
  • 38 Vote(s) - 2.92 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Rate Thread
#1
A Play-by-Play Retelling of Yesterday’s Downtime

Hi everyone! Smiley Happy

As many Residents saw, we had a pretty rough day on the Grid yesterday. I wanted to take a few minutes and explain what happened. All of the times in this blog post are going to be in Pacific Time, aka SLT.

Shortly after 10:30am, the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.

A few minutes before 11:30am we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method - turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.

Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.

We tried again at roughly 12:30pm, doing a third of the login hosts at a time, but this too was too much. We had to stop on that attempt and shut down all logins again around 1:00pm.

On our third attempt, which started once the system cooled down again, we took it really slowly, and brought up each login host one at a time. This worked, and everything was back to normal around 2:30pm.

My team is trying to figure out why we had to turn the login servers back on much more slowly than in the past. We’re still not sure. It’s a pretty interesting challenge, and solving hard problems is part of the fun of running Second Life. Smiley Happy

Voice services also went down around this time, but for a completely unrelated reason. It was just bad luck and timing.

We did have one bright spot! Our status blog handled the load of thousands of Residents checking it all at once much better. We know it wasn’t perfect, but it showed much improvement over the last central database failure, and we’ll keep getting better

My team takes the stability of Second Life very seriously, and we’re sorry about this outage. We now have a new challenging problem to solve, and we’re on it.

April Linden



[To see links please register here]

Reply




Possibly Related Threads…
Thread Author Replies Views Last Post
  Play SL, Win Real Money: Linden Lab Launches New Games With Linden Dollar Prizes Second Life 0 1,845 11-16-2016, 11:13 PM
Last Post: Second Life
  Yes, SL Mainlanders Getting Big Boost in Prims to Play With Second Life 0 1,544 11-02-2016, 08:06 PM
Last Post: Second Life
  Second Life "Let's Play" Videos Need More People! Second Life 0 1,994 10-04-2016, 08:15 PM
Last Post: Second Life

Forum Jump:

1 Guest(s)
Share this:

About Second Life Copybot

Second Life CopyBot Forum is a place where you can get items for Second Life and other vitual worlds for free. With our CopyBot viewers you can export and import any content from these virtual worlds and modify them in 3D software such as Blender, 3D studio Macx etc...