RSS

Tag Archives: Communication

Thinking outside the Box in Disaster Recovery


I had an interesting ‘aha’ moment awhile back.

I was trying to take a well-earned day off from work, visit my mom on her birthday, and in general, do a little decompressing.

I’d recently taken her portrait (a profession from a former life) and my goal was to give it to her and later take her to an appointment and then lunch.  We were between the appointment and lunch when my phone buzzed.

It was one of my Faithful Minions from the land of Eva Perón, who was in a bit of a bind.

Seems that a server had a database that had gone corrupt.

No, not the Chicago style “vote early, vote often” corruption.

Database level corruption.

It seems that this server had been giving us some trouble with backups (it didn’t want to back certain databases up, depending on what node of the cluster it was on) – and the team responsible for fixing that had been working on it for a few days – with limited success.

Aaaand another one of my Faithful Minions, from the land of Chicken Curry, had tried several things, one of which included detaching the database.

Which worked.

It’s just that once you detach it, you likely want to reattach it.

But trying to reattach a database that’s been munged to some degree is a bit of an issue, as SQL tries to recover the database and chew through the transaction log file prior to bringing it online.

That’s as it should be.

Problem is, this db had some level of corruption that would cause an issue when it got to 71% recovered. Stack dump, the whole nine yards.

It was, in a word (or several) a perfect example of a disaster recovery exercise. Yay! I’d always wanted to have one of my very own…

(well, not really, and not on my day off, and…)

Hmm…

I gave my Faithful Minion from the land of Perón some instructions while sitting in the car in front of the restaurant Mom and I were going to have lunch at, and then tried to enjoy birthday lunch with her, but still had my mind being pulled dripping out of the clam chowder and back to the database while Minion worked on his Google Fu a bit to try to figure the error out. Mom and I finished up lunch and I took her home, where I logged in to see what Minion and I could accomplish.

So the thing is, we use DPM (part of Microsoft’s System Center suite of products) to do backups. That’s Microsoft’s Data Protection Manager, a program – no – a system that really takes some getting used to. There will be another post in this series about it – but one of the big things you have to get used to as a DBA is that you don’t do backups anymore…

You don’t use Maintenance Plans.

You don’t use Ola Hallengren’s or any of the other amazing tools out there to handle backups.

You really don’t do backups any more.

DPM does.

No backup jobs on your servers…

No backup drives…

No…

Backups…

DPM handles them all semi-invisibly…

It is – well… weird.

It took me a long time to wrap my head around it.

But it works.

And that’s where this situation kind of got a little strange.  (Well, strange*r*)

See, DPM creates Recovery Points (its version of a backup) and it will stage the backup on disk either locally on the SQL box or in the DPM system where you have MDF and LDF files created with some sort of wizardry before it’s shipped off to tape.

So, we poked and prodded, and did all sorts of technical things, until my next set of Minions from The Pearl of the Orient came online – and we tried to find the backup of the database from DPM.  This took a good while.

Longer than we were expecting, actually.

While they were looking and I was troubleshooting, much time had passed in the land of rain (a bit south of Seattle) and lunch was long, long past.  Mom made some dinner while I was working on my laptop, sitting in my dad’s old chair, while I said goodbye to and thanked a very tired Minion in the land of Perón,   Meanwhile, Minions in the Pearl of the Orient, and Minions (and a Minion in Training) in the land of Curry were all trying to help, trying to troubleshoot, and in the case of Minion in Training, trying to learn by soaking up individual droplets of knowledge from a fire hose of a real live DR exercise.

Which is where it got interesting…

See, with DPM, you can choose a recovery point you want to restore to, and what it will do is simply replace the MDF and LDF files on your SQL box with the ones it’s reconstructed in all its magic.

The good news about that? Once you’ve learned how DPM works, anyone can pretty much restore anything that’s been backed up. That means Exchange servers, file servers, SQL servers, you name it…

It is (and I hesitate to use this word at all) kind of cool.

You need a restore?

All you have to do is open up DPM, find the database you need…

<click>

<typety type>

<find the backup that happened just prior to whatever blew up.>

<click> <click>

<click>

<pick the recovery point you want>

<click… scrollllllllll…. Scrollscrollscroll <click> <click>

Then you wait till DPM says it’s ready.

And tadaa…

That is…

If DPM had backed that database up.

But for that particular database…

On that particular node…

Of that particular cluster…

That database had not been backed up.

By DPM.

For… Let’s just say it was out of SLA, shall we?

It was truly a perfect storm.

The team *had* been working on it – it just hadn’t gotten actually fixed yet, and the crack that this database had slipped through grew wider and wider as it became clear to us what had happened.

It’s the kind of storm any one of us could run into, and this time it was me.

We looked high and low for a DPM backup set, and found one on that was supposed to be on tape.  But, given that this was Thursday where I was and Friday where various minions and co-minions were, the earliest we could get the tape would be by Monday…

More likely Tuesday…

On a scale of one to bad, we were well past bad.  We’d passed Go, had not collected our $200.00 – and…

Well, it was bad.

Several days’ worth of waiting on top of the existing time lost wasn’t really going to cut it.

Then, Co-Minion found that there was a DPM backup on disk – but a day older than the one on tape.

I could have that one in two hours (the time it took to copy the mdf/ldf files over and then put them in the right location.)

Hmmmm. A backup in the hand is worth two in the – um… Tape library?

So we copied it over.

And I attached it…

And now we had a database on the box that worked.

Yay, system was up and running.

But…

it wasn’t the database I wanted.

Like any of you out there, I wanted the backup from just before the database had blown up, and because of all the troubleshooting that had happened before I got involved, some of which actually made the situation quite a bit more challenging, it turned out I couldn’t do anything with the files I had… so reluctantly, I had to go with the older files, simply because I could get them quicker.

And the next day, while trying to fix another issue on that cluster, I got to talking to the guy who runs the SAN and the VM’s, and I explained what all we’d gone through over the course of getting that database back, and he said, “Wait, when did you need it for?”

I told him.

“Oh, I’ve got copies on the SAN of those drives – we could mount it if you need – 12 hours ago, 24 hours ago, whatever…”

He… You…

What?

I realized what he was telling me was that if I’d contacted him (or, frankly, known to contact him at that time of night) – I could have had a much more recent backup sooner instead of spending so many hours trying to fix the existing, corrupt database.

What it meant, was I was thinking inside the various boxes I was dealing with…

Trying to get a *SQL* backup…

Trying to get a *DPM* backup…

When what I needed, frankly, was bits on disk.

I needed MDF and LDF files.

Not only could DPM get me MDF and LDF files, but so could the SAN.

And the SAN backups had them.

I just didn’t know it.

By the time I found out – the old database we’d attached had had data written into it and had been up for a good while. Merging the two would have been far, far more of an issue than it was already (I’d had experience with that once before as an end user) .

So where does this leave us?

It means that if you’ve got a SAN as your storage mechanism that’s backing itself up, you might find yourself thinking outside the box you’re used to thinking of when trying to restore/recover a database.  Go buy the person or people running your SAN a cup of decent coffee and just chat with them, and see if this kind of stuff is possible in your shop.

They might be able to help you out (and save you substantial amounts of time) in ways you quite literally couldn’t have imagined.

In the end, I learned a lot…

My team learned a lot.

My mom learned a lot.

I will be making some changes based on what we learned, with the goal of being able to have some more resiliency in both the hardware resources we have as well as my faithful Minions who worked hard at trying to fix things.

Ultimately, one of the takeaways I got from this was something simple, but also something very profound.

Make sure the stuff you’re doing in your day job continues to work, so you can actually have a day off to spend with your loved ones when you need it.

Oh – and the portrait of mom? She loved it.

It’s here:

The portrait of Mom I wanted to deliver to her.

The portrait of Mom I delivered to her.

 

Take care out there, folks…

Tom

 

Advertisements
 
Leave a comment

Posted by on April 1, 2016 in Uncategorized

 

Tags: , , , , , ,

Life Lessons from SQLSaturdayABQ – Including one from Bugs Bunny


SQLSaturdayABQ

So it’s been a couple of weeks – but SQL Saturday ABQ has finally simmered long enough for me to write about it. I learned so much about so many things down there, and am tremendously grateful for the opportunity to share not only a meal and some learning with professional colleagues, but also reconnect with old friends and make new ones.

The trip down was great – I found out that putting my phone into Airplane Mode seemed to put some pretty cool Airplanes into pictures of an already gorgeous landscape.

I learned the Albuquerque is at 5,000 feet, and for someone used to living at sea level, I learned to appreciate the simple things in life, like, say, air.

(On our second day there the wonderful friend we were staying with took us up to the Sandia Peak Tramway. She and my wife enjoyed the gift shop at the lower elevations while I, still getting used to the 5,000 foot elevation,  went up another 5000 feet on the tram.

No, no oxygen masks fell out of the sky, but I could definitely feel all 10,378 feet of altitude there.

After I got back down, we explored some more, stopped at some lower elevations, and got a little perspective on the mountains for a little further north,

We stopped in a little town called Bernalillo and saw the Coronado Historic Site (above), and could hear the Rio Grande River gurgling down below. It was so different from what we have here in Washington – one could easily say it’s ugly and brown, but that would be missing the point – it’s got a beauty all its own, and needs to be looked at with different eyes.

As for the sessions with SQL Saturday itself…

I learned things about PowerShell in the session from Mike Fal ( b | t ) that made me want to get my hands dirty and try to find problems we’re struggling with in our environments that could be solved with a little PowerShell, and he proved that yes, you can indeed type in a demo, and then promptly demo’d why not to do it. J I loved the examples he gave, and the fact that he stuck with “learn the concepts, don’t freak about the code” – in large part because, just like in Field of Dreams (if you build it, they will come), with PowerShell, if you learn the concepts, the code will follow. You just have to understand what you want to do first – and that happens with the concepts. I’ve learned that if you give someone a problem first, and then give them a pile of tools, they’ll figure stuff out, and you’ll see creative juices flowing as they start thinking about new ways to solve old problems. It’s kind of fun to watch, and more fun to be part of.

I didn’t take any pictures of Jason Horner’s (b|t) session, but I enjoyed his presentation very much, which was full of demos and examples of databases much bigger than the ones I handle. As I recall, there were 4 (FOUR) MCM’s in the room… We were walking on hallowed ground there – and just that level of conversation, questions, and knowledge was fun to be around. I look forward to seeing more of his presentations, and applying what I learned there.

John Morehouse’s ( b | t ) session on social media had an interesting cross section of people in it – some quite experienced who are used to it, but always looking for new things to learn, and some who were absolutely new to the game and had never, ever used it.

John talked about how it can benefit you professionally, how easy it is to blur the lines between personal and professional, and how to do your best to keep them separate if needed. He made a very valid point that no matter where you are, you’re an ambassador for your company – so “think first, then post” along with the idea that once you hit send, it’s out there. Know your company’s policies on social media. That’s a huge thing. Even if you think it’s a private message, the wrong screenshots in the wrong places can be embarrassing for a long time. We also had a live demo of twitter, and how to do everything from getting a question answered with #sqlhelp, to getting a job via social media.

And I learned a lot – I got better doing my presentation on “Life Lessons in Communication” for the second time – and finally getting the butterflies in my stomach to at least fly in formation. I learned a lot from my audience, and had something totally off script pop into my head during the presentation – with the simple sentence of, “Are you solving the problem? Or managing it?” I realized I learned as much from my own presentation and audience as they did from me. Oh – and lessons are everywhere. You don’t even have to look hard. You just have to pay attention.

It was a lot of fun.

Many, many thanks to Meredith ( b | t ) and crew for getting everything together, and for the absolutely wonderful speaker’s dinner the night before. (what was that smokey salsa-y stuff on the chips? that was amazing!)

What came next was a time that can only be described as a slice of heaven.

Any of you in the IT industry know that the whole work/life balance thing is something that has to be managed very, very deliberately.  The cost of not managing it can be ridiculously high.

And so, for the next few days, I was able to spend that precious thing called time chatting with my wife and our friend, getting incredible amounts of fuzz therapy from two wonderful dogs, and just spending time away from the computer.  I allowed myself the time to absorb some of the lessons I’d learned at SQLSaturdayABQ, and then, as I watched some of the hot air balloons drift by, I realized that not all of the lessons I learned there had to do with SQL… a lot simply had to do with life. Some of them are still simmering, but all of them will end up in a story sooner or later.

Again – thanks to…

…all who attended the presentation (Jason had said I couldn’t start until he was there – I had the pre-presentation butterflies and was quietly hoping for him to be late – but he was there in the front row when I got there – so there was no backing out at all :-) –

…to Avanade for giving me the time off to go do this, and

…to our wonderful friend for sharing her lovely home and hospitality with us, and last but most certainly not least

…to my family for their patience and faith in me as I worked through it.

There are many more lessons out there to be learned, and as I find them, I’ll do my best to share them.

Oh, here’s just one:

You know how Bugs Bunny always says, “I knew I should have taken a left turn at Albuquerque!“?

I now know why he couldn’t do that.  🙂

Take care out there, folks…

Tom

 
1 Comment

Posted by on March 2, 2015 in Uncategorized

 

Tags: , , , , , , , ,

Communication, Snapshots, and Chickens (no, really)


I saw a little note the other day about “snapshots” – which reminded me of a situation we had at work awhile back.

It seems that one of the things that’s really helpful at work is being on the same page as your colleagues/coworkers.

And the way you do that is by communicating and understanding each other – and the fact that we often use different words to mean the same thing – or sometimes, we use the same words to mean different things, can present a problem.

Allow me to explain – and of course, I’ll do it with a very non-technical story…

A number of years ago, my mom was at a church social event where they had this icebreaker kind of activity, and one of the things they were supposed to do in this one was to form groups, and they were all handed cards with the name of a farm animal on it, and they had to make the sounds of these animals, and all of the like ‘animals’ were supposed to find each other, and gather together in groups.

Mom’s group was chickens.

Chickens are chickens, right?

Because chickens – or – roosters anyway, go cock-a-doodle-doo, right?

Well… If you grew up in America, roosters go cock-a-doodle-doo.

So there were a bunch of middle aged ladies, walking around this room, flapping their arms and sounding just like a barnyard. (okay, I just checked with her – they weren’t flapping their arms, but that image is too fun to let go… so with apologies to mom, I’ll let that burn in for a moment… J )

There were cows, horses, pigs – and chickens, well, roosters.

There were cock-a-doodle-doo roosters, and then there was this one gickerigeek rooster, and –

Waitaminute…

What the heck was a gickerigeek rooster?

Well, it turns out that if you’re a chicken (well, rooster) in Germany, you go ‘gickerigeek’.

You don’t go ‘cock-a-doodle-doo’.

And because of that, all of the animals who were looking for each other, found each other.

Except this one little forlorn German chicken (well, rooster), running around the room, flapping her arms (okay, not really), making the most plaintive ‘gickerigeek’ you’ve ever heard.  Come to think of it, it’s likely the only ‘gickerigeek’ you’ve ever heard.  But the thing was, as accurate as this sound was in describing a rooster’s sound, it was a sound that no one recognized.

Eventually this got someone’s attention – and suddenly there was this entire barnyard full of little old ladies interrogating a very accurate, very fun loving, and yet, very stubborn little old lady (my mom).

They asked her what was up with this whole gickerigeek thing, and the truth came out.  It became clear to them that there are different words to mean the same thing, it just depends on where you come from, and from there on out, they knew that the sound that a rooster made in the morning was heard as ‘gickerigeek’ by some, and as ‘cock-a-doodle-doo’ by others.

I ran into something like this at work the other day, where the same word was being used to mean two radically different things.  There was this rather heavy duty discussion about snapshot backups and databases – and my take was that they were absolutely not a valid backup solution… I’m thinking of it from a SQL perspective.

The fellow I was talking to was the guy who runs our SAN – and he was thinking of the word ‘snapshot’ from a totally different perspective, that of the SAN itself. Used that way, the way he was doing it, it was indeed a valid backup solution.  Not what I would have liked, but valid nonetheless.

Problem was, we were both hearing the same sound, but those working on the hardware end of things were essentially talking English, and thinking ‘cock-a-doodle-doo, while I, working in SQL, was hearing it in German, thinking that sound only meant ‘gickerigeek’.

Interestingly, it turns out that we were both right, but it took the digital equivalent of me running around the room, flapping my arms going ‘gickerigeek’ for quite some time before we were able to clear it up.

The end result was that everyone learned why, when we were doing our hourly full snapshot backups through the SAN, and while everything else looked right, the transaction log kept getting full.  The thought was that the log file shouldn’t grow, but it did, to the point of filling up a good percentage of the drive.  Some config changes, a lot of learning and understanding, and we were ready to go, problem solved.

Moral of the story? Well, just like the rooster my mom was hearing, it was so important to discover and understand that the “rooster” we were hearing really didn’t sound the same to everyone.  All the technical smarts in the world won’t solve your problems if you can’t get on the same page, and eventually we did. It was good to have it cleared up, and it saved us hundreds of gigs of drive space as a result.

H/T to David Klee (T|B) for the spark to write this, which included a link to the technical explanation in more detail.

 
2 Comments

Posted by on January 9, 2014 in Uncategorized

 

Tags: , ,