I had an interesting ‘aha’ moment awhile back.
I was trying to take a well-earned day off from work, visit my mom on her birthday, and in general, do a little decompressing.
I’d recently taken her portrait (a profession from a former life) and my goal was to give it to her and later take her to an appointment and then lunch. We were between the appointment and lunch when my phone buzzed.
It was one of my Faithful Minions from the land of Eva Perón, who was in a bit of a bind.
Seems that a server had a database that had gone corrupt.
No, not the Chicago style “vote early, vote often” corruption.
Database level corruption.
It seems that this server had been giving us some trouble with backups (it didn’t want to back certain databases up, depending on what node of the cluster it was on) – and the team responsible for fixing that had been working on it for a few days – with limited success.
Aaaand another one of my Faithful Minions, from the land of Chicken Curry, had tried several things, one of which included detaching the database.
It’s just that once you detach it, you likely want to reattach it.
But trying to reattach a database that’s been munged to some degree is a bit of an issue, as SQL tries to recover the database and chew through the transaction log file prior to bringing it online.
That’s as it should be.
Problem is, this db had some level of corruption that would cause an issue when it got to 71% recovered. Stack dump, the whole nine yards.
It was, in a word (or several) a perfect example of a disaster recovery exercise. Yay! I’d always wanted to have one of my very own…
(well, not really, and not on my day off, and…)
I gave my Faithful Minion from the land of Perón some instructions while sitting in the car in front of the restaurant Mom and I were going to have lunch at, and then tried to enjoy birthday lunch with her, but still had my mind being pulled dripping out of the clam chowder and back to the database while Minion worked on his Google Fu a bit to try to figure the error out. Mom and I finished up lunch and I took her home, where I logged in to see what Minion and I could accomplish.
So the thing is, we use DPM (part of Microsoft’s System Center suite of products) to do backups. That’s Microsoft’s Data Protection Manager, a program – no – a system that really takes some getting used to. There will be another post in this series about it – but one of the big things you have to get used to as a DBA is that you don’t do backups anymore…
You don’t use Maintenance Plans.
You really don’t do backups any more.
No backup jobs on your servers…
No backup drives…
DPM handles them all semi-invisibly…
It is – well… weird.
It took me a long time to wrap my head around it.
But it works.
And that’s where this situation kind of got a little strange. (Well, strange*r*)
See, DPM creates Recovery Points (its version of a backup) and it will stage the backup on disk either locally on the SQL box or in the DPM system where you have MDF and LDF files created with some sort of wizardry before it’s shipped off to tape.
So, we poked and prodded, and did all sorts of technical things, until my next set of Minions from The Pearl of the Orient came online – and we tried to find the backup of the database from DPM. This took a good while.
Longer than we were expecting, actually.
While they were looking and I was troubleshooting, much time had passed in the land of rain (a bit south of Seattle) and lunch was long, long past. Mom made some dinner while I was working on my laptop, sitting in my dad’s old chair, while I said goodbye to and thanked a very tired Minion in the land of Perón, Meanwhile, Minions in the Pearl of the Orient, and Minions (and a Minion in Training) in the land of Curry were all trying to help, trying to troubleshoot, and in the case of Minion in Training, trying to learn by soaking up individual droplets of knowledge from a fire hose of a real live DR exercise.
Which is where it got interesting…
See, with DPM, you can choose a recovery point you want to restore to, and what it will do is simply replace the MDF and LDF files on your SQL box with the ones it’s reconstructed in all its magic.
The good news about that? Once you’ve learned how DPM works, anyone can pretty much restore anything that’s been backed up. That means Exchange servers, file servers, SQL servers, you name it…
It is (and I hesitate to use this word at all) kind of cool.
You need a restore?
All you have to do is open up DPM, find the database you need…
<find the backup that happened just prior to whatever blew up.>
<pick the recovery point you want>
<click… scrollllllllll…. Scrollscrollscroll <click> <click>
Then you wait till DPM says it’s ready.
If DPM had backed that database up.
But for that particular database…
On that particular node…
Of that particular cluster…
That database had not been backed up.
For… Let’s just say it was out of SLA, shall we?
It was truly a perfect storm.
The team *had* been working on it – it just hadn’t gotten actually fixed yet, and the crack that this database had slipped through grew wider and wider as it became clear to us what had happened.
It’s the kind of storm any one of us could run into, and this time it was me.
We looked high and low for a DPM backup set, and found one on that was supposed to be on tape. But, given that this was Thursday where I was and Friday where various minions and co-minions were, the earliest we could get the tape would be by Monday…
More likely Tuesday…
On a scale of one to bad, we were well past bad. We’d passed Go, had not collected our $200.00 – and…
Well, it was bad.
Several days’ worth of waiting on top of the existing time lost wasn’t really going to cut it.
Then, Co-Minion found that there was a DPM backup on disk – but a day older than the one on tape.
I could have that one in two hours (the time it took to copy the mdf/ldf files over and then put them in the right location.)
Hmmmm. A backup in the hand is worth two in the – um… Tape library?
So we copied it over.
And I attached it…
And now we had a database on the box that worked.
Yay, system was up and running.
it wasn’t the database I wanted.
Like any of you out there, I wanted the backup from just before the database had blown up, and because of all the troubleshooting that had happened before I got involved, some of which actually made the situation quite a bit more challenging, it turned out I couldn’t do anything with the files I had… so reluctantly, I had to go with the older files, simply because I could get them quicker.
And the next day, while trying to fix another issue on that cluster, I got to talking to the guy who runs the SAN and the VM’s, and I explained what all we’d gone through over the course of getting that database back, and he said, “Wait, when did you need it for?”
I told him.
“Oh, I’ve got copies on the SAN of those drives – we could mount it if you need – 12 hours ago, 24 hours ago, whatever…”
I realized what he was telling me was that if I’d contacted him (or, frankly, known to contact him at that time of night) – I could have had a much more recent backup sooner instead of spending so many hours trying to fix the existing, corrupt database.
What it meant, was I was thinking inside the various boxes I was dealing with…
Trying to get a *SQL* backup…
Trying to get a *DPM* backup…
When what I needed, frankly, was bits on disk.
I needed MDF and LDF files.
Not only could DPM get me MDF and LDF files, but so could the SAN.
And the SAN backups had them.
I just didn’t know it.
By the time I found out – the old database we’d attached had had data written into it and had been up for a good while. Merging the two would have been far, far more of an issue than it was already (I’d had experience with that once before as an end user) .
So where does this leave us?
It means that if you’ve got a SAN as your storage mechanism that’s backing itself up, you might find yourself thinking outside the box you’re used to thinking of when trying to restore/recover a database. Go buy the person or people running your SAN a cup of decent coffee and just chat with them, and see if this kind of stuff is possible in your shop.
They might be able to help you out (and save you substantial amounts of time) in ways you quite literally couldn’t have imagined.
In the end, I learned a lot…
My team learned a lot.
My mom learned a lot.
I will be making some changes based on what we learned, with the goal of being able to have some more resiliency in both the hardware resources we have as well as my faithful Minions who worked hard at trying to fix things.
Ultimately, one of the takeaways I got from this was something simple, but also something very profound.
Make sure the stuff you’re doing in your day job continues to work, so you can actually have a day off to spend with your loved ones when you need it.
Oh – and the portrait of mom? She loved it.
Take care out there, folks…