RSS

DPM… and “Why are my SQL Transaction Logs filling up?” (the answer… really)


Here at Avanade we run a lot of pre-release software from Microsoft, so we can work the bugs out, so to speak, and thus be able to get better solutions to the customers we’ve got.

One of the software packages that we use is the System Center Suite – including, in this case, the Data Protection Manager part.  It’s now in production, released product for some time, and we now run all our backups with it.

The thing that is totally weird for me as a DBA is that all of the backup jobs I had on the boxes got disabled, and DPM took over. It was like driving a car with an automatic transmission for the first time. “What do you mean I don’t have to do backups?”

If you have things set up properly (which is a deeper subject than this blog post is meant for) – it’s close to hands off – which, come to think of it, is still weird.

So, over the time we’ve had it – I’ve found that there are several things to be aware of in the implementation of it.  This is not an all-inclusive list of surprises, this is just a couple of the things that I found out – after researching them myself and finding a lot of bad information out there.

The information below is from my own experience, from conversations with Microsoft, and from research I’ve done.  That said, my goal is to help keep you out of the DPM weeds by helping you

  1. Understand how DPM handles SQL backups in the varying recovery models you can have (full, simple, etc.)
  2. Understand where it needs drive space. (this can be an absolutely evil ‘gotcha’ if you’re not careful)
  3. Edge cases.
  4. How items 2 and 3 can intertwine and get you into deep doo-doo and what you want to do to stay out of it.

So, ready?

Oh – assumptions:

You’ve got DPM installed, and for the most part, configured.  It’s working, but you have transaction log drives filling up on some of your servers, and it’s not really clear why.

Wanna know why?

Here’s the answer:

It’s because the UI is very unclear, because the documentation is unclear, (there was a hint of it on page 83) and because the things that would be obvious to a DBA simply aren’t mentioned.

So, having said that – let’s explain a little.

After a few years of running it, and we flog it to within an inch of its life, I’ve come to – if not make friends with it, then at least I’ll give it grudging respect.

But you have to know what you’re doing.

So first off: Settings.

You organize your backups in DPM in what are called protection groups.

Think of it as a glorified schedule, with lots of properties you can adjust.

Now when you’re creating the protection group, you can create it to back up file servers, Exchange servers, and SQL servers.  We’ll talk about SQL servers only here.

So when you back up a SQL box, you might have some databases (let’s say the system ones, Master, Model, and MSDB) in simple recovery mode, and the user databases in full recovery.

What you’d do is create two protection groups.

One for the system databases, for simple recovery.

And one for the user databases, for full recovery.

And this is where the first gotcha comes into play.

See, when you create that protection group, you’re going to come across a tab in the creation of it that gives you this choice… It’s like the matrix… Blue pill? Red pill?

And, when we were setting it up, we scoured the documentation to try to figure out how to set it up for databases that were in full recovery mode as well as doing mirroring.

And we kept running into problems with full transaction logs.

It turned out I wasn’t alone in this.

I googled the problem…

I binged the problem..

Heck, I even duckduckgo’ed the problem.

Everyone, it seemed, had the same question.

And the answers were surprisingly varied.

And most, honestly, were wrong.  (Note: that’s not to sound arrogant, this was a tough one to crack, so a lot of folks were coming up with creative ways to try to work around this issue)

Some folks were doing what we’d done initially *just* to keep the systems running. (manual backup, flip to simple recovery, shrink the logs, flip back to full recovery) – yes, we absolutely shredded the recovery chain, just shredded it – but the data in that application was very, very transient, so keeping it up and functioning was more important than keeping the data forever.

So while we were frantically – okay, diligently – searching for answers, managing the problem, we were also looking for a cure to the problem, because there was no possible way this could be an enterprise level application if it was behaving so badly… Right? There had to be some mistake, some setting we (and everyone else in those searches above) weren’t seeing, and it finally ended up in a support call.

My suspicion was that the transaction logs weren’t being backed up at all, even though that’s what we thought we were setting.

I’d been around software enough to know that clicking on an insignificant little button could wreak catastrophic results if that’s not the button you meant to push.

And this was one of them.

See, the databases that were in full recovery (namely those in that app that had the mirroring) were our problem children.  Databases in simple recovery weren’t.

It made me wonder why.

And one day, I was on the phone about another DPM issue, (for another post) and I asked the question, “So what exactly is happening if I click on this button versus that one? Because my suspicion is that the tlog is not getting backed up at all.”

And then I asked the more crucial question for me, and likely for you who are reading this:  “What code is being executed behind these two options?”

And the fellow over at Microsoft researched it for me and came back with this:

“In DPM, when we setup synchronization for SQL protection, those Syncs are log backups. DPM achieves that by running the following SQL Query

BACKUP LOG [<SQL_DB>] TO DISK = N'<Database_location_folder>\DPM_SQL_PROTECT\<Server_Name>\<Instance>\<DB_Name>_log.ldf\Backup\Current.log'

Recovery Points are Express Full backups. If we setup a group to run sync just before recovery points, DPM will create a snapshot from the replica and then create a Full Database backup (sync).

BACKUP DATABASE [<SQL_DB>] TO VIRTUAL_DEVICE='<device_name>'

WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

In this case we will never have log backup thus no log truncation should be expected.”

What does this mean?

It means that if you have a database in full recovery, you will want to put it in a protection group that is set to schedule the backup every X minutes/hours like this:

In DPM, click on “Protection” tab (lower left), then find the protection group.

Right click on it and choose ‘Modify’ as below.

GEEQL_DPM_Modify_Protection_Group

Expand the protection group and pick the server you’re trying to set up backups for there – you’ll do some configuring, and you’ll click next a few times, but below is the deceptively simple thing you have to watch…  This dialogue box below – which will have the protection group name up at the very top (where I’ve got ‘GROUP NAME HERE’ stenciled in) can bite you in the heinie if you’re not careful.  So given what I’ve written, and from looking at this and reading what I wrote above – can you tell whether this is backing up databases in full or simple recovery mode?

GEEQL_DPM_Backup_Full_Recovery

See how it’s backing up every 1 hour(s) up there?

That means the code it’s running in the background is this:

BACKUP LOG [<SQL_DB>] TO DISK = N'<Database_location_folder>\DPM_SQL_PROTECT\<Server_Name>\<Instance>\<DB_Name>_log.ldf\Backup\Current.log'

We’ll get into more detail in a bit, but this means you won’t have full transaction logs.  This is the setting you want for the protection group you’ve got to backup your databases in full recovery mode (and the ones that are mirrored or in Availability Groups). The other option you have is to back up “Just before a recovery point” – which, if you’re thinking in terms of SQL and transaction logs, really doesn’t make a lot of sense.  We went through the documentation at one point, and I think we were right around 83 pages in before it gave an indication of what it *might* be doing here – but even so it wasn’t clear, but now we know.  So what you’d want to have in this protection group would be a bunch of databases in full recovery mode.  You might want to create different protection groups for different servers, or different schedules, that’s all up to you… The crux is, if it’s a database in full recovery mode, this is how you want to set it up, by backing up every X minutes/hours… Making sense?

Okay, let’s take a look at the other option…

GEEQL_DPM_Backup_Simple_Recovery

If you have a database in simple recovery, you’ll want to put it in a protection group that does backups just before the recovery point.  And that’s what the screenshot above does.    When you click on that radio button, the code it runs in the background if you’re backing up SQL databases, is this:

BACKUP DATABASE [<SQL_DB>] TO VIRTUAL_DEVICE='<device_name>'

WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

And you should be set.

You can change the frequency of the express full backups by clicking on the ‘modify’ button in the dialogue above, and you’ll have quite a few options there.

Understand, you have several different buckets to put your databases in.

  1. Simple recovery (see above)
  2. Full recovery (see above)
  3. Whatever frequency you need for your systems (from above)
  4. Whatever schedule you need for your systems (from above)

Believe it or not, that’s it.

Put the right things in the right place, and DPM is close to a ‘set it and forget it’ kind of a deal.

However…

…there are some Gotchas and some fine print.  This is stuff I’ve found, and your mileage may vary – but just be aware of the below:

  • If you put a db that’s in simple recovery into the protection group meant for databases that are in full recovery, you’ll likely get errors with DPM complaining that it can’t backup the log of a database that’s in simple recovery mode. Since you manually put that db (in simple recovery mode) into that protection group (that’s configured to back up databases in full recovery mode), it will be your job to get it out and put it in the right protection group.  That will make the alerts go away.
  • If you put a db that’s in full recovery mode into the protection group meant for simple, you’ll fill up your transaction logs, fill up your disks, and your backups will fail, and you may, depending on a few factors, hork up your database pretty bad… (this is what most people complain about, and that will solve the disk space issue). And, since you (or someone on your team) likely put the db in the wrong protection group, putting it in the right protection group will be the first thing to do… Having enough space on your log drives is critical at this point – because DPM will start to make copies of your transaction logs as part of its backup process, and will need the room (as in, half of your transaction log drive).  More details below.
  • I’ve found a couple of Corollaries to go with this:
    • Corollary 1: DPM creates a folder on your transaction log drive called “DPM_PROTECT” -it stores copies of the transaction logs in there.  Those are the ‘backups’.
      • You have a choice between compressing backups and encrypting them…
      • If you encrypt them, they’re full sized, even if they’re empty.
      • So if you have transaction logs filling up 50% of your t-log drive – guess what’s filling up the other half?  (DPM’s t-log backups).  That DPM_PROTECT folder is a staging folder and is sometimes full, sometimes not (making it devilishly hard to monitor for), but you need to be aware that if that folder fills up half the drive, you’re running very close to disaster, and that’s when you have to start getting creative in your problem solving (see ‘examples’ below)
    • Corollary 2: DPM can be configured to put the DPM_PROTECT folder on a separate drive, which may suit your needs, and is a topic that will have to be discussed in a separate post, but if you run your transaction log drives pretty full, and have cheaper storage available, this might be an option for you to consider.  We don’t have ours architected that way, so it’s an option I’ve not tried.
  • Examples of things that can go very wrong (like the disaster mentioned above)
    • If you are working with a clustered SQL server and are getting alerts because your log file can’t grow, chances are it’s because your transaction log drive (or mountpoint) is full, and it will be full of both transaction logs, and DPM’s staged backups of the transaction logs.To fix this, you will either need to
      • Extend the drive/make it bigger  (assuming that’s an option for you) and then restart your DPM backups.
        • Note: DPM will likely want to run a ‘validation’ at this point, which will take some time.  My recommendation is to let it do that, but there’s a huge “it depends” associated with this one.  Sometimes – depending on how long things have been broken before you were able to get to them, you might find yourself essentially taking that database out of DPM’s protection group and starting over.  It breaks the recovery chain, but can be faster than letting DPM do its validation of what it thinks your latest backup is compared to your existing one (where you likely broke the recovery chain with the manual log backups)… Like I said, it depends..
      • (not advised, but if you’ve run out of options) backup the database once, and backup the log manually repeatedly (via SQL, not DPM, because you’re trying to empty the drive that DPM has filled up) until you can shrink the transaction log so you have space on the drive for DPM to actually stage a copy of the log file for backup.
        • Once you’ve done that, remember, you’ll have fixed one issue but created another, namely, your recovery chain ends in DPM where you started doing manual SQL backups.  Now you have backups in DPM, and a full + a bunch of log backups in SQL.  Make sure you have a full set of backups in case things go wrong.
    • You’re working with a mirrored server or a server that’s part of an availability group and the databases are in the wrong protection group (simple instead of full recovery)…. You’ve got transaction logs filling up from both the replication that’s involved in this, and transaction logs filling up because they’re not being backed up… It gets ugly.  I ran into this (and wrote about our resolution to it here) where we had issues with log file growth, high numbers of virtual log files, and an availability group with multiple geographically dispersed secondaries… It was, um… “fun…” <ahem>

So…  All this to say something very very simple: Pick the right recovery group, know what’s going on behind the curtains that are in front of what DPM is doing behind the scenes, and honestly, you should be good.  If you understand what radio button to select when you’re configuring the protection group, you as a DBA are about 90% of the way there.  Make sure your transaction log file drives are twice as big as you think they should be (or configure DPM to store them elsewhere), because chances are, you’ll be using half of the transaction log drives for the logs themselves, and the other half for temporary storage of the backups of those transaction logs.

Know what your protection groups will do for you… Know the gotchas, and DPM will, strangely enough, be your friend…

Take care out there – and keep your electrons happy.

 
6 Comments

Posted by on June 21, 2016 in Uncategorized

 

Tags: , , , , ,

Thinking outside the Box in Disaster Recovery


I had an interesting ‘aha’ moment awhile back.

I was trying to take a well-earned day off from work, visit my mom on her birthday, and in general, do a little decompressing.

I’d recently taken her portrait (a profession from a former life) and my goal was to give it to her and later take her to an appointment and then lunch.  We were between the appointment and lunch when my phone buzzed.

It was one of my Faithful Minions from the land of Eva Perón, who was in a bit of a bind.

Seems that a server had a database that had gone corrupt.

No, not the Chicago style “vote early, vote often” corruption.

Database level corruption.

It seems that this server had been giving us some trouble with backups (it didn’t want to back certain databases up, depending on what node of the cluster it was on) – and the team responsible for fixing that had been working on it for a few days – with limited success.

Aaaand another one of my Faithful Minions, from the land of Chicken Curry, had tried several things, one of which included detaching the database.

Which worked.

It’s just that once you detach it, you likely want to reattach it.

But trying to reattach a database that’s been munged to some degree is a bit of an issue, as SQL tries to recover the database and chew through the transaction log file prior to bringing it online.

That’s as it should be.

Problem is, this db had some level of corruption that would cause an issue when it got to 71% recovered. Stack dump, the whole nine yards.

It was, in a word (or several) a perfect example of a disaster recovery exercise. Yay! I’d always wanted to have one of my very own…

(well, not really, and not on my day off, and…)

Hmm…

I gave my Faithful Minion from the land of Perón some instructions while sitting in the car in front of the restaurant Mom and I were going to have lunch at, and then tried to enjoy birthday lunch with her, but still had my mind being pulled dripping out of the clam chowder and back to the database while Minion worked on his Google Fu a bit to try to figure the error out. Mom and I finished up lunch and I took her home, where I logged in to see what Minion and I could accomplish.

So the thing is, we use DPM (part of Microsoft’s System Center suite of products) to do backups. That’s Microsoft’s Data Protection Manager, a program – no – a system that really takes some getting used to. There will be another post in this series about it – but one of the big things you have to get used to as a DBA is that you don’t do backups anymore…

You don’t use Maintenance Plans.

You don’t use Ola Hallengren’s or any of the other amazing tools out there to handle backups.

You really don’t do backups any more.

DPM does.

No backup jobs on your servers…

No backup drives…

No…

Backups…

DPM handles them all semi-invisibly…

It is – well… weird.

It took me a long time to wrap my head around it.

But it works.

And that’s where this situation kind of got a little strange.  (Well, strange*r*)

See, DPM creates Recovery Points (its version of a backup) and it will stage the backup on disk either locally on the SQL box or in the DPM system where you have MDF and LDF files created with some sort of wizardry before it’s shipped off to tape.

So, we poked and prodded, and did all sorts of technical things, until my next set of Minions from The Pearl of the Orient came online – and we tried to find the backup of the database from DPM.  This took a good while.

Longer than we were expecting, actually.

While they were looking and I was troubleshooting, much time had passed in the land of rain (a bit south of Seattle) and lunch was long, long past.  Mom made some dinner while I was working on my laptop, sitting in my dad’s old chair, while I said goodbye to and thanked a very tired Minion in the land of Perón,   Meanwhile, Minions in the Pearl of the Orient, and Minions (and a Minion in Training) in the land of Curry were all trying to help, trying to troubleshoot, and in the case of Minion in Training, trying to learn by soaking up individual droplets of knowledge from a fire hose of a real live DR exercise.

Which is where it got interesting…

See, with DPM, you can choose a recovery point you want to restore to, and what it will do is simply replace the MDF and LDF files on your SQL box with the ones it’s reconstructed in all its magic.

The good news about that? Once you’ve learned how DPM works, anyone can pretty much restore anything that’s been backed up. That means Exchange servers, file servers, SQL servers, you name it…

It is (and I hesitate to use this word at all) kind of cool.

You need a restore?

All you have to do is open up DPM, find the database you need…

<click>

<typety type>

<find the backup that happened just prior to whatever blew up.>

<click> <click>

<click>

<pick the recovery point you want>

<click… scrollllllllll…. Scrollscrollscroll <click> <click>

Then you wait till DPM says it’s ready.

And tadaa…

That is…

If DPM had backed that database up.

But for that particular database…

On that particular node…

Of that particular cluster…

That database had not been backed up.

By DPM.

For… Let’s just say it was out of SLA, shall we?

It was truly a perfect storm.

The team *had* been working on it – it just hadn’t gotten actually fixed yet, and the crack that this database had slipped through grew wider and wider as it became clear to us what had happened.

It’s the kind of storm any one of us could run into, and this time it was me.

We looked high and low for a DPM backup set, and found one on that was supposed to be on tape.  But, given that this was Thursday where I was and Friday where various minions and co-minions were, the earliest we could get the tape would be by Monday…

More likely Tuesday…

On a scale of one to bad, we were well past bad.  We’d passed Go, had not collected our $200.00 – and…

Well, it was bad.

Several days’ worth of waiting on top of the existing time lost wasn’t really going to cut it.

Then, Co-Minion found that there was a DPM backup on disk – but a day older than the one on tape.

I could have that one in two hours (the time it took to copy the mdf/ldf files over and then put them in the right location.)

Hmmmm. A backup in the hand is worth two in the – um… Tape library?

So we copied it over.

And I attached it…

And now we had a database on the box that worked.

Yay, system was up and running.

But…

it wasn’t the database I wanted.

Like any of you out there, I wanted the backup from just before the database had blown up, and because of all the troubleshooting that had happened before I got involved, some of which actually made the situation quite a bit more challenging, it turned out I couldn’t do anything with the files I had… so reluctantly, I had to go with the older files, simply because I could get them quicker.

And the next day, while trying to fix another issue on that cluster, I got to talking to the guy who runs the SAN and the VM’s, and I explained what all we’d gone through over the course of getting that database back, and he said, “Wait, when did you need it for?”

I told him.

“Oh, I’ve got copies on the SAN of those drives – we could mount it if you need – 12 hours ago, 24 hours ago, whatever…”

He… You…

What?

I realized what he was telling me was that if I’d contacted him (or, frankly, known to contact him at that time of night) – I could have had a much more recent backup sooner instead of spending so many hours trying to fix the existing, corrupt database.

What it meant, was I was thinking inside the various boxes I was dealing with…

Trying to get a *SQL* backup…

Trying to get a *DPM* backup…

When what I needed, frankly, was bits on disk.

I needed MDF and LDF files.

Not only could DPM get me MDF and LDF files, but so could the SAN.

And the SAN backups had them.

I just didn’t know it.

By the time I found out – the old database we’d attached had had data written into it and had been up for a good while. Merging the two would have been far, far more of an issue than it was already (I’d had experience with that once before as an end user) .

So where does this leave us?

It means that if you’ve got a SAN as your storage mechanism that’s backing itself up, you might find yourself thinking outside the box you’re used to thinking of when trying to restore/recover a database.  Go buy the person or people running your SAN a cup of decent coffee and just chat with them, and see if this kind of stuff is possible in your shop.

They might be able to help you out (and save you substantial amounts of time) in ways you quite literally couldn’t have imagined.

In the end, I learned a lot…

My team learned a lot.

My mom learned a lot.

I will be making some changes based on what we learned, with the goal of being able to have some more resiliency in both the hardware resources we have as well as my faithful Minions who worked hard at trying to fix things.

Ultimately, one of the takeaways I got from this was something simple, but also something very profound.

Make sure the stuff you’re doing in your day job continues to work, so you can actually have a day off to spend with your loved ones when you need it.

Oh – and the portrait of mom? She loved it.

It’s here:

The portrait of Mom I wanted to deliver to her.

The portrait of Mom I delivered to her.

 

Take care out there, folks…

Tom

 

 
Leave a comment

Posted by on April 1, 2016 in Uncategorized

 

Tags: , , , , , ,

Adventures in Fixing VLF issues on a database in an Availability Group, backed up with DPM, with multiple secondaries


Hey all,

We had a server failover recently during a demonstration to one of the C-level folks, and as you might imagine, it was brought to my attention. The reason for the failover wasn’t something I could control, but the length of time the relevant database took to come up was something I could – so when it finally came up, I noticed the errorlog with this little note:

Database <dbname> has more than 1000 virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files.

Hmm…

More than 1000, huh?

Slow recovery from failover, huh?

Hmm…

Where had I seen this before?  Oh yeah, some years back, when we were doing datacenter maintenance, and the *one* sharepoint database  that had all our documentation on (ahem) “How to bring up a datacenter.”  in it took a half an hour to come back – I remember now!

So if you don’t know – having a metric buttload (note – that’s a real term there, go look it up) of virtual log files means that every time your server fires up a database, it has to replay the transaction log to get all up to date and everything, and if the transaction log’s got a bunch of virtual log files, it’s like trying to run data backwards through a shredder… it takes time, and that’s the issue.

I remembered Michelle Ufford’s ( blogtwitter ) post that I’d found back then about counting Virtual log files and tried it on this server now that it was up.

And got this nice little message:

Msg 213, Level 16, State 7, Line 1

Column name or number of supplied values does not match table definition.

Back to “Hmmm…”

Well, I knew dbcc loginfo(dbname) from her script was inserting into a table that the script created, and it had worked before. It had been awhile since I’d run it – so I figured maybe there was some change in what it returned based on the SQL version. So I decided to see what it was inserting…

Well hey – the original script was missing a field (first column above)

So I added it (RecoveryUnitID – the first one below)

…and it worked fine.

Turns out SQL 2008R2 (where the original script worked) returns different fields than 2012 and 2014 (where it didn’t).

I figured I didn’t want to find out which version of the script to use every time I needed to run it on a server, so I told the script to figure that out by itself, and then run the appropriate hunk of code (example below)

So what you see in that code is a combination of this:

https://msdn.microsoft.com/en-us/library/ms174396.aspx from Microsoft to check the version number,

and this

http://sqlfool.com/2010/06/check-vlf-counts/ from Michelle Ufford.

A couple of things to understand here:

  1. This code won’t fix the problem (read Kimberly’s posts below for that)
    1. http://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/
    2. http://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/
  2. It will, however, show you that a problem exists.
  3. …that you have to fix and manage (Back to Kimberly’s posts)

You’ll want to do some fixing that involves:

  1. Lots of log backups and log file shrinking, then log growing (See Kimberly’s posts for more info on this)
  2. Adjusting the auto-growth setting from the 1 mb/10% default setting to something significantly larger (Meanwhile, back in Kimberly’s posts… Seriously, check them out)
  3. Oh, and make those adjustments on the Model database so that every new database you create has decent growth patterns instead of the 1 mb/10% bit)
  4. Tadaa – you should find much faster failovers on your clustered systems (assuming you’re still using them and not Availability groups)

So… that made it look easy, right? Well, let me give you the Paul Harvey Version, “The Rest of the Story.”

What I learned, and what complicated things a bit…

My system was an Availability group with two nodes running synchronously, and two nodes in two different datacenters running asynchronously.

They’re backed up by Microsoft’s Data Protection Manager (DPM) that is configured to back up the secondary node.

So my plan was to back up the secondary – only when I did, it told me that the last recovery point (what a backup’s called in DPM) had something to do with Marty McFly, namely that it had been backed up in the future – about 8 hours into the future.

Hmm… Okay, that told me the secondary it was backing up was one of the asynchronous ones, and it was either in the land of fish and chips, or the land of beer and pretzels, either way – it was a weird way to think about trying to preserve the recovery chain.

I created recovery points through DPM, which did it on one of those secondaries, which, strangely, helped to clear the log on the primary.

Then I shrunk the log, in stages, on the primary, which then replicated out to the various secondaries locally and across the pond.

I was able to get the VLF count down from well north of 33,000 to just over 600. The log wouldn’t shrink any more than that, which told me there was a transaction hanging on toward the end of one of the the virtual log files, and frankly, for right then, that was okay.

To fix it further, I almost entertained the idea of flipping the recovery model from full to simple, clearing the log file completely, then flipping it back to full, but that would have totally broken my recovery chain, and with three secondaries, I had no desire to try to fix all the fun that would have created. (Translation: Rebuilding the availability group – Uh… No.)

I then grew the log file to an appropriate size for that database, then increased the growth rate on the database files to something far more in line with what Kimberly mentioned above.

All of this was done on the primary, and the changes I made propagated out nicely, so I only had to make them in one spot.

Oh, while I was researching this…

(I’d done this before – reducing VLF counts, but never on an Availability Group, much less one with multiple secondaries in different time zones where DPM is backing up one of the secondaries, where it’s not obvious which one of the secondaries is the one being backed up)

…I noticed other people had clearly run into this same issue (reducing VLF counts) – and had not only entertained the idea of automating VLF reduction, but taken the idea out drinking, dancing, and somehow ended up with a receipt for a cheap motel room – and then put it all into a stored procedure.

That ran…

In a job…

On a schedule…

Um.

I decided that breaking the recovery chain was bad enough, but shredding it was so much worse – so this part will not be automated.

At all.

Oh – and here’s the code. It’s Michelle’s, with a couple of tweaks from me and one from Microsoft.

--Code to determine how many virtual log files there are in a database.
--modified from the original code here: http://sqlfool.com/2010/06/check-vlf-counts/
--to check for multiple versions of SQL because of schema changes between versions
--if any of the results are > 1000, do multiple log backups (or DPM Recovery points
--to clear out the transaction log + shrinking the tlog to get rid of
--the extra virtual log 'fragments'.
--As always - test this in your test environments first.  Your mileage may vary. Some 
--settling of contents may have occurred during shipping.  You know the drill folks.
--Best reference out there for managing this:
-- http://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/
-- http://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/
--the code below handles SQL 2008R2, 2012, and 2014. If you're running something other than that,
--just run a dbcc loginfo(YourDatabaseName) and make sure the #stage table has the same
--columns in it
--if it doesn't, add a section below.
DECLARE          @sqlversion SYSNAME
SET              @sqlversion =(SELECT CONVERT(SYSNAME,SERVERPROPERTY('ProductVersion')))
PRINT            @sqlversion
-- SQL 2008R2

IF @sqlversion LIKE '10.50.%'
     BEGIN
          CREATE TABLE #stage
                                  (
                                     FileID        INT
                                   , FileSize      BIGINT
                                   , StartOffset   BIGINT
                                   , FSeqNo        BIGINT
                                   , [Status]      BIGINT
                                   , Parity        BIGINT
                                   , CreateLSN     NUMERIC(38)
                                  );

          CREATE TABLE #results
                                  (
                                     Database_Name SYSNAME
                                   , VLF_count     INT
                                  );
          EXEC sp_msforeachdb  N'Use ?;
                                Insert Into #stage
                                Exec sp_executeSQL N''DBCC LogInfo(?)'';
                                Insert Into #results
                                Select DB_Name(), Count(*)
                                From #stage;
                                Truncate Table #stage;'
         SELECT         *
         FROM           #results
         ORDER BY       VLF_count DESC
         DROP TABLE     #stage
         DROP TABLE     #results
     END
     --sql 2012/2014 - note - new temp table names because sql thinks old ones are there.
IF @sqlversion >= '11'
     BEGIN
          CREATE TABLE #stage2
                                  (
                                     RecoveryUnitID INT
                                   , FileID         INT
                                   , FileSize       BIGINT
                                   , StartOffset    BIGINT
                                   , FSeqNo         BIGINT
                                   , [Status]       BIGINT
                                   , Parity         BIGINT
                                   , CreateLSN      NUMERIC(38)
                                   );
          CREATE TABLE #results2
                                  (
                                   Database_Name SYSNAME
                                   , VLF_count INT
                                   );
          EXEC        sp_msforeachdb N'Use ?;
                                     Insert Into #stage2
                                     Exec sp_executeSQL N''DBCC LogInfo(?)'';
                                     Insert Into #results2
                                     Select DB_Name(), Count(*)
                                     From #stage2;
                                     Truncate Table #stage2;'
          SELECT         *
          FROM           #results2
          ORDER BY       VLF_count DESC
          DROP TABLE     #stage2
          DROP TABLE     #results2
     END

Thanks to the #sqlhelp folks for their advice, suggestions, good-natured harassment, and ribbing:

Allen McGuire (blog | twitter)
Argenis Fernandez (blog | twitter)
Tara Kizer @tarakizer (blog | twitter)
Andre Batista @klunkySQL (blog | twitter)

and of course, the ones whose reference material I was able to use to put this together:

Michelle Ufford (blog | twitter)
Kimberly Tripp (blog | twitter)
Paul Randal (blog | twitter)

<outttakes>
When I told Argenis about the DPM involvement and secondaries in multiple timezones, his response🙂

 
2 Comments

Posted by on January 5, 2016 in Uncategorized

 

Tags: , , , , ,

Paul and the Butterfly Effect


So now that most of you reading this have come back home from either PASS Summit or MVP Summit and are trying to get your heads around being back home and/or in the office again, so grab a cup of coffee and allow me to share a story with you…

(for those of you who have no idea what I’m talking about – imagine a bunch of introverted computer geeks getting together, eating together, drinking various liquids together, learning together, and quite often having an amazingly good time while they’re at it)

This story started out as a response to a note Paul Randal (blog | twitter) wrote just after summit almost a year ago, and for any of you who’ve read my stories, you’ll be familiar with the phrase, “And it got me thinking” – usually about halfway down the story.  But this time it starts at the beginning, see, Paul wrote about this thing called “the butterfly effect” and he was talking about how making one small change in something could make major changes later on…

The links above are Paul’s, the links below will likely reference my own stories, which you’ll understand as you read along.

Paul asked for feedback on what he wrote about the butterflies, and I started writing, and thinking, and thinking, and writing.

And it went something like this:

Gosh, Paul,

It’s hard to figure out where the first butterfly flapped its wings in this story…

And please forgive me – it turned out longer than I expected, but you’ll understand when you get to the end.

I was working at Microsoft (1996-2000), and over time had sent out stories about my then 7 year old son (like this one) – and one day, I got a surprisingly snarky reply-all response back from a fellow I’d considered to be a friend.  We’d worked together (he in England, me in Redmond) – and while I wanted to fire back with all the self-righteousness in the world, how he, a young, single man, didn’t understand the joy that comes from having the privilege of having children, I backed off.

I checked with a trusted friend for advice, and then suggested to him, kindly, that in time, he would indeed understand the joy one can have experiencing a story like the one above, but he was not yet old enough to understand that.  I wished him well, and told him that I hoped that someday he would be able to experience that privilege.  I made sure not to burn any bridges, because his comment was based on youth and inexperience, and I truly valued his friendship.

Fast forward 10 years.

We’d both grown, we’d both lived, but most importantly, we’d stayed friends.

He’d moved to the US, worked for several companies, including the one I work for now, and is now back at Microsoft, based out of Chicago.

Some time back, he heard I was looking for work, and within a day, had written a letter of recommendation based on years of trust, years of both personal and professional friendship, and in part due to that – I had an opportunity to interview and was subsequently hired.

I had/got/made/took opportunities to speak (I did sessions down at SSWUG.org for a few years) and then got involved in the SQL community, starting, as I recall, with Chris Shaw (blog | twitter ).  He was one of the speakers at one of the SSWUG sessions at the time (I didn’t know about PASS then), and I was just being me, flipping him crap, and he flipped it back.  We laughed, then he asked if I wanted to speak.  I couldn’t imagine him asking that, but he did, and the conversation went like this:

“Have you ever done any public speaking?”

“Only Eulogies.”

<crickets>

“No really.  One was for my dad, and one was for a friend in my cancer survivor’s support group.”

I couldn’t imagine anything more stressful than doing a eulogy for your own dad, or for a friend who had died of the same disease you were fighting, too, so I figured I’d give it a shot.  How hard could it be, right?

Little did I know – but I worked hard, made some presentations, and I did it for three years. I met wonderful people there (Chris Shaw & Wendy Pastrick  ( blog | twitter ) were with me in the studio one of those times) and met others through the sessions.

Wendy Pastrick, me, and Chris Shaw

Wendy Pastrick, me, and Chris Shaw

…and Paul, I had a ball doing it.

This year I spoke at the SQLSaturday in Portland just before Summit – on communication, and how important it is, and how hard it is to do well. (I have a Bachelor’s degree in Communication and a Master’s in Visual Communication (photojournalism) and am still learning)

And it got me thinking, as so many stories do…

Had I gotten mad at John (the friend in Chicago) and burned that bridge way back in 1997 when I sent that original emailed story out (I think it says 1998 in the one on the blog) – then I wouldn’t have had him as a resource to get the job I have today.

Had I not gotten that job, I wouldn’t have had the speaking opportunity in Tucson.

…nor would I have gone to Summit.

Or had the opportunities to speak.

Or made the friends I’ve made.

I wouldn’t have realized there were other people just as lonely as I was out there who worked in their own little cubicle, being one of very few people in their companies doing what they do.

I wouldn’t have learned about #sqlhelp, and #sqlfamily, and summer camp for geeks (Summit)

I wouldn’t have learned that just by tweeting something with DBCC in it, in short order you could get an answer from the guy who wrote it.

As a result, my mind has been in a state of continuous bogglement (if that’s a word) for the last 7 years.

(you realize this list could go on for a good long while.)

But I did learn – and I know a little about these things now.

I know there are folks out there who will help, who will encourage, and who will cheer me on should I need it.

Just today, I found myself on the cheering/encouraging end of that equation (one member of the SQL community came home from Summit to one less family member in the house)

I don’t know if in doing that, I was a butterfly helping someone in their own life journey. I don’t know..

And I don’t think any of those things (that family member excluded) would have happened had I gotten mad at John those many years ago.

Yes, I do think about that.

Oh, speaking of John… He and his wife now have two boys – and he understands.

Take care Paul – thanks for making me think – that was fun.

Tom

PS: not to overload you, but to take the butterfly back even further – almost 100 years ago, there was this little piece of Russian Shrapnel

if it hadn’t hit where it did, you wouldn’t be reading any of this.

And I sent it…

And just now – it got me thinking some more…

Some of the butterflies in life are good ones.  Some are bad ones.

All of them got you to where you are today.

So I’ll end this one, uncharacteristically, with a question:  As you think back, What are your butterflies? What got you to where you are today?

Take care out there, folks,

Tom

 
Leave a comment

Posted by on November 6, 2015 in Uncategorized

 

Tags: , , , , , , ,

My 10,000th tweet – and #sqlfamily


I’ve been trying to figure out what kind of a profound thing I could write for this milestone tweet… have been thinking about it for days now and trying to edit it down to 140 characters – and – well, I’m cheating a bit and putting a link in here.

The thing that strikes me in the 4 years I’ve been tweeting is this thing we call #sqlfamily.

I don’t remember the whole history about it – but it seemed that one of the places it spawned was Summit a few years ago – where several thousand introverted geeks get together who’ve been typing at each other for years – and some of us were kind of surprised to realize, over time, that there were real people on the other end of those keyboards we’d been tweeting with.

People with… with meat on them, who you can laugh with, share a coffee, a drink, or a meal with, and see the sparkle in their eyes as you realize over the years you’ve known this person, but not *in* person.

You realize that #sqlfamily isn’t just folks who are passionate about SQL – though that’s a huge part of it.  You learn that those in #sqlfamily are folks who have lives outside of work, who have families, and hopes, and dreams, and fears, and….

…and we’re more than people who chase electrons around the planet, we’re people, who’ve become more than just colleagues, we’ve become friends.

…who, when we’re the only ones who have a clue what goes on in our databases at work and have no one to talk to at home about work things (you know “the look”) – find that there’s someone out there in #sqlfamily who actually gets it…

That’s #sqlfamily.

Professionally, some of us have been around long enough to have some gray hair, and have made enough mistakes that we can share a bit of the wisdom that comes from that.

Some are quite young and have the energy and enthusiasm for new things that those with experience can learn from.

Personally, we encourage  each other in times of trial, whether that’s personal or professional.  We spontaneously raise money to help those of us who have fallen, or raise money for a cause, not because we’re looking for glory, but because it’s the right thing to do.

We cheer each other on, whether it’s in some level of self-improvement, or trying to get healthier, or a new life event…

We support each other offline through conversations no one else will ever hear through some of life’s darkest moments

And we flip each other unending amounts of crap – virtual pats on the back, inside jokes and “did you remember when…” moments.

I’ve noticed several businesses trying to use the #sqlfamily hashtag for marketing – and that’s not what it’s for, I had to clarify that with one of them during one of those dark times to get them to understand it, and some did.

#sqlfamily is more than people we work with, it’s more than friends we see a couple of times a year.

#sqlfamily – to me… It’s sacred.

And I’m proud, and honored, to be part of it.

 
4 Comments

Posted by on July 23, 2015 in Uncategorized

 

Tags:

Keeping the Electrons Happy


So I work as a database administrator, which is often something that confounds people who aren’t in IT… if I try to explain to them what I do, their eyes glaze over like a Krispy Kreme donut, and I realize I have to back up and introduce things in a little simpler way, so I usually start off with the sentence:

“I piss off electrons all over the world.”

And that gets them laughing, then I can tell them what I really do.

And the thing is, I work for a consulting firm, but I rarely, if ever, see the front end of the applications we run, or work on or build. I see domains, and servers, and databases.

My life at work consists of solving problems, pissing off electrons, and trying not to do the same to people.

For the most part, this is a good set of priorities to have.

As a DBA, you do your best to keep from making mistakes, right? You want your users happy, you want your electrons just slightly irritated, and you want your data to be as perfect as it can be.

It’s hard to see what all this looks like on the front end until you actually see it, and then it’s like being a doctor and seeing symptoms, and intuitively understanding what those symptoms mean.

And, surprise surprise, I have a story about that.

So some years back I had to go to the doc to go over some test results.

When I checked in, they asked all sorts of other questions, like, “Have you ever gone under another name? Have you ever lived at this address? And on and on and on…

I’d been at this place before, they already had this information in their system. There was no reason for them to be asking me these questions.

So I turned the tables, and asked them why they were asking me all these questions.

It turns out someone was in their system with my birthdate and my SSAN…

Hmmm…

So the good thing about this?

They noticed it.

The bad thing?

Well, – let me just take you through it – and see if you can figure it out…

The receptionist made a telephone call to fix the problem.

I was ushered back to see the doc, the nurse asked what medicines I was on. This was unusual since it was in their system, but she took all my records and entered them into the system. Sometime later, the doc came in and wanted to compare my records, current to past, and so on, but there was nothing of my medical records after 2007, meaning there were several years of records missing.

What do you think happened at this point? Take a guess, then keep reading.

I put two and two together, and because of the strange series of questions I’d been asked initially, I’d paid attention, and had made a note of the security guy’s name and number, and asked the doc if he wanted me to call, because, “It’s what I do”.

He, still trying to figure out what was going on with the keyboard, said,“Sure.”

I picked up my cell phone, called the number I’d written down and asked to talk to the guy whose name I’d written down below that.

His response: “Oh, I was trying to fix that and must have pushed the wrong button by mistake.”

I just let the silence hang in the air for a little bit while he realized what he’d said and who he’d said it to…

…and then I took off my patient hat, and sitting there in the doctor’s office, put on my DBA hat, cracked my knuckles a bit, and told him to let me talk to the person who was fixing the problem because by golly it was GOING to get fixed, and get fixed right then.

I was connected to the person who was fixing the issue, and explained that I was (ahem) SITTING IN THE DOCTORS OFFICE WHEN MY RECORDS DISAPPEARED, and would she kindly make them bloody well reappear.

Now.

She said she could merge the records, said it would take till the end of the day.

Have you ever had to project an air of calmness when it was the last thing you wanted to do?

Yeah… that’s where I was.

I tried to explain to her that I was still (ahem) SITTING IN THE DOCTORS OFFICE, WITH THE DOCTOR, WAITING TO GO OVER RESULTS, and would appreciate it if she could fix that long before the end of the day, like before I left the office, since I was there in the doctor’s office, and we were about to make some pretty significant decisions based on those results.

She said she’d work on it, and I gave her my cell number and told her to call me when she was done.

In a triumph of the anti-nerds, the doc got out a huge pad of paper and a marker and drew on it to show us what he remembered of the information he’d seen before it was deleted, and was most of the way through that when my phone buzzed.

It was the gal from IT who had worked on un-futzing the records.

She said it was fixed. She’d merged the records (her words).

The doc logged into the system again, and all my records were there.

Both he and my wife were impressed.

So we got those things taken care of, I went in later and found that in the results of the merge, my meds showed up twice, (once from the original, once from when the nurse recorded them after they were deleted).

So we (well, they) had to fix that.

I had a little chat with the IT folks at the hospital after that to get them to understand that there were folks out here who knew what was going on in there and what it meant.

And it needed to be fixed.

Because, I realized, as much as I tell people I just piss off electrons, those electrons mean something to someone.

And they have to be accurate.

And it got me thinking…

See, while I was pretty annoyed that the problem existed (it’s not that the whole situation wasn’t already stressful enough), it was an amazing learning experience, to step out of the role of being the patient and tell the doctor who was about to try to fix me, that he needed to let me do my work, in his office, so I could fix his problems, so he could see what he needed to see in order to fix mine.

Yeah, I had to look at that paragraph a time or two myself.

I wondered how long it would have taken if that had happened to someone else right then, because seriously, who else would have been in a position to know what had happened and what it meant?

I’m still a little baffled by it all, and I think that’s okay.

I started looking around at the projects the company I work for does (I work for Avanade, a consulting company), and it always astonishes me what all kinds of things we’re involved in – the stuff I rarely see, but affects people’s lives in ways that are big and small, from the drinks you might get on a flight across the country to buying a motorcycle to, even figuring stuff out at the doctor’s office.

And it makes me realize that I don’t want to piss off the electrons at all.

I want to keep them very, very happy.

 
1 Comment

Posted by on April 20, 2015 in Uncategorized

 

Tags: , ,

Life Lessons from SQLSaturdayABQ – Including one from Bugs Bunny


SQLSaturdayABQ

So it’s been a couple of weeks – but SQL Saturday ABQ has finally simmered long enough for me to write about it. I learned so much about so many things down there, and am tremendously grateful for the opportunity to share not only a meal and some learning with professional colleagues, but also reconnect with old friends and make new ones.

The trip down was great – I found out that putting my phone into Airplane Mode seemed to put some pretty cool Airplanes into pictures of an already gorgeous landscape.

I learned the Albuquerque is at 5,000 feet, and for someone used to living at sea level, I learned to appreciate the simple things in life, like, say, air.

(On our second day there the wonderful friend we were staying with took us up to the Sandia Peak Tramway. She and my wife enjoyed the gift shop at the lower elevations while I, still getting used to the 5,000 foot elevation,  went up another 5000 feet on the tram.

No, no oxygen masks fell out of the sky, but I could definitely feel all 10,378 feet of altitude there.

After I got back down, we explored some more, stopped at some lower elevations, and got a little perspective on the mountains for a little further north,

We stopped in a little town called Bernalillo and saw the Coronado Historic Site (above), and could hear the Rio Grande River gurgling down below. It was so different from what we have here in Washington – one could easily say it’s ugly and brown, but that would be missing the point – it’s got a beauty all its own, and needs to be looked at with different eyes.

As for the sessions with SQL Saturday itself…

I learned things about PowerShell in the session from Mike Fal ( b | t ) that made me want to get my hands dirty and try to find problems we’re struggling with in our environments that could be solved with a little PowerShell, and he proved that yes, you can indeed type in a demo, and then promptly demo’d why not to do it. J I loved the examples he gave, and the fact that he stuck with “learn the concepts, don’t freak about the code” – in large part because, just like in Field of Dreams (if you build it, they will come), with PowerShell, if you learn the concepts, the code will follow. You just have to understand what you want to do first – and that happens with the concepts. I’ve learned that if you give someone a problem first, and then give them a pile of tools, they’ll figure stuff out, and you’ll see creative juices flowing as they start thinking about new ways to solve old problems. It’s kind of fun to watch, and more fun to be part of.

I didn’t take any pictures of Jason Horner’s (b|t) session, but I enjoyed his presentation very much, which was full of demos and examples of databases much bigger than the ones I handle. As I recall, there were 4 (FOUR) MCM’s in the room… We were walking on hallowed ground there – and just that level of conversation, questions, and knowledge was fun to be around. I look forward to seeing more of his presentations, and applying what I learned there.

John Morehouse’s ( b | t ) session on social media had an interesting cross section of people in it – some quite experienced who are used to it, but always looking for new things to learn, and some who were absolutely new to the game and had never, ever used it.

John talked about how it can benefit you professionally, how easy it is to blur the lines between personal and professional, and how to do your best to keep them separate if needed. He made a very valid point that no matter where you are, you’re an ambassador for your company – so “think first, then post” along with the idea that once you hit send, it’s out there. Know your company’s policies on social media. That’s a huge thing. Even if you think it’s a private message, the wrong screenshots in the wrong places can be embarrassing for a long time. We also had a live demo of twitter, and how to do everything from getting a question answered with #sqlhelp, to getting a job via social media.

And I learned a lot – I got better doing my presentation on “Life Lessons in Communication” for the second time – and finally getting the butterflies in my stomach to at least fly in formation. I learned a lot from my audience, and had something totally off script pop into my head during the presentation – with the simple sentence of, “Are you solving the problem? Or managing it?” I realized I learned as much from my own presentation and audience as they did from me. Oh – and lessons are everywhere. You don’t even have to look hard. You just have to pay attention.

It was a lot of fun.

Many, many thanks to Meredith ( b | t ) and crew for getting everything together, and for the absolutely wonderful speaker’s dinner the night before. (what was that smokey salsa-y stuff on the chips? that was amazing!)

What came next was a time that can only be described as a slice of heaven.

Any of you in the IT industry know that the whole work/life balance thing is something that has to be managed very, very deliberately.  The cost of not managing it can be ridiculously high.

And so, for the next few days, I was able to spend that precious thing called time chatting with my wife and our friend, getting incredible amounts of fuzz therapy from two wonderful dogs, and just spending time away from the computer.  I allowed myself the time to absorb some of the lessons I’d learned at SQLSaturdayABQ, and then, as I watched some of the hot air balloons drift by, I realized that not all of the lessons I learned there had to do with SQL… a lot simply had to do with life. Some of them are still simmering, but all of them will end up in a story sooner or later.

Again – thanks to…

…all who attended the presentation (Jason had said I couldn’t start until he was there – I had the pre-presentation butterflies and was quietly hoping for him to be late – but he was there in the front row when I got there – so there was no backing out at all :-) –

…to Avanade for giving me the time off to go do this, and

…to our wonderful friend for sharing her lovely home and hospitality with us, and last but most certainly not least

…to my family for their patience and faith in me as I worked through it.

There are many more lessons out there to be learned, and as I find them, I’ll do my best to share them.

Oh, here’s just one:

You know how Bugs Bunny always says, “I knew I should have taken a left turn at Albuquerque!“?

I now know why he couldn’t do that. 🙂

Take care out there, folks…

Tom

 
1 Comment

Posted by on March 2, 2015 in Uncategorized

 

Tags: , , , , , , , ,