RSS

DPM… and “Why are my SQL Transaction Logs filling up?” (the answer… really)

21 Jun

Here at Avanade we run a lot of pre-release software from Microsoft, so we can work the bugs out, so to speak, and thus be able to get better solutions to the customers we’ve got.

One of the software packages that we use is the System Center Suite – including, in this case, the Data Protection Manager part.  It’s now in production, released product for some time, and we now run all our backups with it.

The thing that is totally weird for me as a DBA is that all of the backup jobs I had on the boxes got disabled, and DPM took over. It was like driving a car with an automatic transmission for the first time. “What do you mean I don’t have to do backups?”

If you have things set up properly (which is a deeper subject than this blog post is meant for) – it’s close to hands off – which, come to think of it, is still weird.

So, over the time we’ve had it – I’ve found that there are several things to be aware of in the implementation of it.  This is not an all-inclusive list of surprises, this is just a couple of the things that I found out – after researching them myself and finding a lot of bad information out there.

The information below is from my own experience, from conversations with Microsoft, and from research I’ve done.  That said, my goal is to help keep you out of the DPM weeds by helping you

  1. Understand how DPM handles SQL backups in the varying recovery models you can have (full, simple, etc.)
  2. Understand where it needs drive space. (this can be an absolutely evil ‘gotcha’ if you’re not careful)
  3. Edge cases.
  4. How items 2 and 3 can intertwine and get you into deep doo-doo and what you want to do to stay out of it.

So, ready?

Oh – assumptions:

You’ve got DPM installed, and for the most part, configured.  It’s working, but you have transaction log drives filling up on some of your servers, and it’s not really clear why.

Wanna know why?

Here’s the answer:

It’s because the UI is very unclear, because the documentation is unclear, (there was a hint of it on page 83) and because the things that would be obvious to a DBA simply aren’t mentioned.

So, having said that – let’s explain a little.

After a few years of running it, and we flog it to within an inch of its life, I’ve come to – if not make friends with it, then at least I’ll give it grudging respect.

But you have to know what you’re doing.

So first off: Settings.

You organize your backups in DPM in what are called protection groups.

Think of it as a glorified schedule, with lots of properties you can adjust.

Now when you’re creating the protection group, you can create it to back up file servers, Exchange servers, and SQL servers.  We’ll talk about SQL servers only here.

So when you back up a SQL box, you might have some databases (let’s say the system ones, Master, Model, and MSDB) in simple recovery mode, and the user databases in full recovery.

What you’d do is create two protection groups.

One for the system databases, for simple recovery.

And one for the user databases, for full recovery.

And this is where the first gotcha comes into play.

See, when you create that protection group, you’re going to come across a tab in the creation of it that gives you this choice… It’s like the matrix… Blue pill? Red pill?

And, when we were setting it up, we scoured the documentation to try to figure out how to set it up for databases that were in full recovery mode as well as doing mirroring.

And we kept running into problems with full transaction logs.

It turned out I wasn’t alone in this.

I googled the problem…

I binged the problem..

Heck, I even duckduckgo’ed the problem.

Everyone, it seemed, had the same question.

And the answers were surprisingly varied.

And most, honestly, were wrong.  (Note: that’s not to sound arrogant, this was a tough one to crack, so a lot of folks were coming up with creative ways to try to work around this issue)

Some folks were doing what we’d done initially *just* to keep the systems running. (manual backup, flip to simple recovery, shrink the logs, flip back to full recovery) – yes, we absolutely shredded the recovery chain, just shredded it – but the data in that application was very, very transient, so keeping it up and functioning was more important than keeping the data forever.

So while we were frantically – okay, diligently – searching for answers, managing the problem, we were also looking for a cure to the problem, because there was no possible way this could be an enterprise level application if it was behaving so badly… Right? There had to be some mistake, some setting we (and everyone else in those searches above) weren’t seeing, and it finally ended up in a support call.

My suspicion was that the transaction logs weren’t being backed up at all, even though that’s what we thought we were setting.

I’d been around software enough to know that clicking on an insignificant little button could wreak catastrophic results if that’s not the button you meant to push.

And this was one of them.

See, the databases that were in full recovery (namely those in that app that had the mirroring) were our problem children.  Databases in simple recovery weren’t.

It made me wonder why.

And one day, I was on the phone about another DPM issue, (for another post) and I asked the question, “So what exactly is happening if I click on this button versus that one? Because my suspicion is that the tlog is not getting backed up at all.”

And then I asked the more crucial question for me, and likely for you who are reading this:  “What code is being executed behind these two options?”

And the fellow over at Microsoft researched it for me and came back with this:

“In DPM, when we setup synchronization for SQL protection, those Syncs are log backups. DPM achieves that by running the following SQL Query

BACKUP LOG [<SQL_DB>] TO DISK = N'<Database_location_folder>\DPM_SQL_PROTECT\<Server_Name>\<Instance>\<DB_Name>_log.ldf\Backup\Current.log'

Recovery Points are Express Full backups. If we setup a group to run sync just before recovery points, DPM will create a snapshot from the replica and then create a Full Database backup (sync).

BACKUP DATABASE [<SQL_DB>] TO VIRTUAL_DEVICE='<device_name>'

WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

In this case we will never have log backup thus no log truncation should be expected.”

What does this mean?

It means that if you have a database in full recovery, you will want to put it in a protection group that is set to schedule the backup every X minutes/hours like this:

In DPM, click on “Protection” tab (lower left), then find the protection group.

Right click on it and choose ‘Modify’ as below.

GEEQL_DPM_Modify_Protection_Group

Expand the protection group and pick the server you’re trying to set up backups for there – you’ll do some configuring, and you’ll click next a few times, but below is the deceptively simple thing you have to watch…  This dialogue box below – which will have the protection group name up at the very top (where I’ve got ‘GROUP NAME HERE’ stenciled in) can bite you in the heinie if you’re not careful.  So given what I’ve written, and from looking at this and reading what I wrote above – can you tell whether this is backing up databases in full or simple recovery mode?

GEEQL_DPM_Backup_Full_Recovery

See how it’s backing up every 1 hour(s) up there?

That means the code it’s running in the background is this:

BACKUP LOG [<SQL_DB>] TO DISK = N'<Database_location_folder>\DPM_SQL_PROTECT\<Server_Name>\<Instance>\<DB_Name>_log.ldf\Backup\Current.log'

We’ll get into more detail in a bit, but this means you won’t have full transaction logs.  This is the setting you want for the protection group you’ve got to backup your databases in full recovery mode (and the ones that are mirrored or in Availability Groups). The other option you have is to back up “Just before a recovery point” – which, if you’re thinking in terms of SQL and transaction logs, really doesn’t make a lot of sense.  We went through the documentation at one point, and I think we were right around 83 pages in before it gave an indication of what it *might* be doing here – but even so it wasn’t clear, but now we know.  So what you’d want to have in this protection group would be a bunch of databases in full recovery mode.  You might want to create different protection groups for different servers, or different schedules, that’s all up to you… The crux is, if it’s a database in full recovery mode, this is how you want to set it up, by backing up every X minutes/hours… Making sense?

Okay, let’s take a look at the other option…

GEEQL_DPM_Backup_Simple_Recovery

If you have a database in simple recovery, you’ll want to put it in a protection group that does backups just before the recovery point.  And that’s what the screenshot above does.    When you click on that radio button, the code it runs in the background if you’re backing up SQL databases, is this:

BACKUP DATABASE [<SQL_DB>] TO VIRTUAL_DEVICE='<device_name>'

WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

And you should be set.

You can change the frequency of the express full backups by clicking on the ‘modify’ button in the dialogue above, and you’ll have quite a few options there.

Understand, you have several different buckets to put your databases in.

  1. Simple recovery (see above)
  2. Full recovery (see above)
  3. Whatever frequency you need for your systems (from above)
  4. Whatever schedule you need for your systems (from above)

Believe it or not, that’s it.

Put the right things in the right place, and DPM is close to a ‘set it and forget it’ kind of a deal.

However…

…there are some Gotchas and some fine print.  This is stuff I’ve found, and your mileage may vary – but just be aware of the below:

  • If you put a db that’s in simple recovery into the protection group meant for databases that are in full recovery, you’ll likely get errors with DPM complaining that it can’t backup the log of a database that’s in simple recovery mode. Since you manually put that db (in simple recovery mode) into that protection group (that’s configured to back up databases in full recovery mode), it will be your job to get it out and put it in the right protection group.  That will make the alerts go away.
  • If you put a db that’s in full recovery mode into the protection group meant for simple, you’ll fill up your transaction logs, fill up your disks, and your backups will fail, and you may, depending on a few factors, hork up your database pretty bad… (this is what most people complain about, and that will solve the disk space issue). And, since you (or someone on your team) likely put the db in the wrong protection group, putting it in the right protection group will be the first thing to do… Having enough space on your log drives is critical at this point – because DPM will start to make copies of your transaction logs as part of its backup process, and will need the room (as in, half of your transaction log drive).  More details below.
  • I’ve found a couple of Corollaries to go with this:
    • Corollary 1: DPM creates a folder on your transaction log drive called “DPM_PROTECT” -it stores copies of the transaction logs in there.  Those are the ‘backups’.
      • You have a choice between compressing backups and encrypting them…
      • If you encrypt them, they’re full sized, even if they’re empty.
      • So if you have transaction logs filling up 50% of your t-log drive – guess what’s filling up the other half?  (DPM’s t-log backups).  That DPM_PROTECT folder is a staging folder and is sometimes full, sometimes not (making it devilishly hard to monitor for), but you need to be aware that if that folder fills up half the drive, you’re running very close to disaster, and that’s when you have to start getting creative in your problem solving (see ‘examples’ below)
    • Corollary 2: DPM can be configured to put the DPM_PROTECT folder on a separate drive, which may suit your needs, and is a topic that will have to be discussed in a separate post, but if you run your transaction log drives pretty full, and have cheaper storage available, this might be an option for you to consider.  We don’t have ours architected that way, so it’s an option I’ve not tried.
  • Examples of things that can go very wrong (like the disaster mentioned above)
    • If you are working with a clustered SQL server and are getting alerts because your log file can’t grow, chances are it’s because your transaction log drive (or mountpoint) is full, and it will be full of both transaction logs, and DPM’s staged backups of the transaction logs.To fix this, you will either need to
      • Extend the drive/make it bigger  (assuming that’s an option for you) and then restart your DPM backups.
        • Note: DPM will likely want to run a ‘validation’ at this point, which will take some time.  My recommendation is to let it do that, but there’s a huge “it depends” associated with this one.  Sometimes – depending on how long things have been broken before you were able to get to them, you might find yourself essentially taking that database out of DPM’s protection group and starting over.  It breaks the recovery chain, but can be faster than letting DPM do its validation of what it thinks your latest backup is compared to your existing one (where you likely broke the recovery chain with the manual log backups)… Like I said, it depends..
      • (not advised, but if you’ve run out of options) backup the database once, and backup the log manually repeatedly (via SQL, not DPM, because you’re trying to empty the drive that DPM has filled up) until you can shrink the transaction log so you have space on the drive for DPM to actually stage a copy of the log file for backup.
        • Once you’ve done that, remember, you’ll have fixed one issue but created another, namely, your recovery chain ends in DPM where you started doing manual SQL backups.  Now you have backups in DPM, and a full + a bunch of log backups in SQL.  Make sure you have a full set of backups in case things go wrong.
    • You’re working with a mirrored server or a server that’s part of an availability group and the databases are in the wrong protection group (simple instead of full recovery)…. You’ve got transaction logs filling up from both the replication that’s involved in this, and transaction logs filling up because they’re not being backed up… It gets ugly.  I ran into this (and wrote about our resolution to it here) where we had issues with log file growth, high numbers of virtual log files, and an availability group with multiple geographically dispersed secondaries… It was, um… “fun…” <ahem>

So…  All this to say something very very simple: Pick the right recovery group, know what’s going on behind the curtains that are in front of what DPM is doing behind the scenes, and honestly, you should be good.  If you understand what radio button to select when you’re configuring the protection group, you as a DBA are about 90% of the way there.  Make sure your transaction log file drives are twice as big as you think they should be (or configure DPM to store them elsewhere), because chances are, you’ll be using half of the transaction log drives for the logs themselves, and the other half for temporary storage of the backups of those transaction logs.

Know what your protection groups will do for you… Know the gotchas, and DPM will, strangely enough, be your friend…

Take care out there – and keep your electrons happy.

Advertisements
 
11 Comments

Posted by on June 21, 2016 in Uncategorized

 

Tags: , , , , ,

11 responses to “DPM… and “Why are my SQL Transaction Logs filling up?” (the answer… really)

  1. Randy

    June 27, 2016 at 10:26 am

    Thanks for the post. We have both Veeam and DPM. Years ago we had a T-SQL diff backup issue traced to Veeam using a full VSS snapshot. SQL server responded to the full snapshot with full database backup to a virtual device. We asked for the Veeam option to do a VSS copy-only snapshot – which was added. I think the same option is present in DPM, in disguise. I think the fundamental problem is that there is one backup chain each database and there’s another for the server. DPM or Veeam can have two backup groups to handle the two types of databases. A non-copy VSS snapshot is fine for the simple databases. What I have to wonder about is when a full VSS snapshot is applied. Seems like a full VSS snapshot would be required for DPM/Veeam system differentials – maybe not. It is as clear as mud. If we use system differentials in Veeam or DPM, does that mean we should no longer use T-SQL backups?

    Maybe SQL Server needs to respond to a VSS full snapshot with a either a normal or copy-only full database backup that depends upon the database recovery setting?

     
    • tomroush

      June 27, 2016 at 4:36 pm

      Wow – that sounds like a doozy…

      Right after I started working at one place I saw that they had a backup running – but they couldn’t run backups every day because the backup process still had a hold on the backup file while the next part of the process tried to do a backup to tape and then delete the existing file that had been backed up to that tape…. All good, right? Backups were failing every other day – turns out that either the SQL backup wasn’t done, or the tape backup wasn’t done – and one program or the other wouldn’t let go of the file. The workaround they’d figured out was to have two sets of backup jobs running on alternating days. What this meant was there was a time (several hours) where there were actually two backups running at the same time. (Luckily we didn’t have to do a restore during that time). I’ve never had to do any kind of a test to restore what was actually being backed up to which file, and what level of confusion happens with the transaction logs. So – I’m glad I didn’t have to deal with that – and understood a bit where you were coming from – even though it was a bit of a different animal…

       
  2. Peter Resele

    September 1, 2016 at 3:14 am

    Thanks for the interesting article… confirming that a lot of people are having the problem… but, sorry, your answer doesn’t work at least for our situation.
    We have a protection group that uses your first config – Sync every 15 minutes, which is the default – and we put a lot of databases into this protection group, most of which are in FULL RECOVERY mode.

    The weird thing is that, while most of these databases are NOT log backuped (only the Express full backup will be executed, only one recovery point will be created – easy to observe in the UI), *some* of these database are ok! We can see it that every 15 minutes a Sync is happening, and their log file will be truncated on the server, and everything works like it should… but those are just two out of 15 databases and I have no idea what is different about them (except, that they are very small). All of them are in the same protection group (i.e. with the same settings).

    For me, it is a BUG, and quite a horrible bug for a product that is in its fifth generation and touted so much by the M company.

     
    • tomroush

      September 1, 2016 at 5:03 pm

      Hey Peter – thanks for writing.
      Wow = that does sound frustrating. So two of the databases are… Wait – try that again for me…
      Databases getting synched successfully every 15 minutes:
      Databases in full recovery but somehow not getting synched as expected:
      And you can see this in the logs and the UI?
      The only thing I’ve seen that’s close to that is we have a small utility database on each server that’s okay, in full recovery, but is backed up weekly because it doesn’t change much at all – but… Yeah, that sounds like a potential bug… Have you tried analyzing what happens during backups with either a sql trace or extended events? That could tell you a lot about what’s actually going on behind the scenes… I’m curious as to what you find out.

      Take care,

      Tom

       
  3. peterresele

    September 5, 2016 at 2:53 am

    I believe we solved it, something that a lot of guys out there seem to have speculated for a long time.

    You need to change the database to SIMPLE mode, and then back to FULL – voila, it works!

    What most likely happened here:

    At some point in time (a couple of years ago, on a different SQL Version or update), we had the databases in question in SIMPLE mode. Then we changed the database to FULL, which – and this is pure speculation! – probably because of a bug in the old SQL Server version – did not change all the database properties as it should have. Everything worked, but only DPM, when it looked at this databases, wrongly concluded they would be SIMPLE – and created a protection strategy for SIMPLE databases (we noticed it that, when we created a new DPM protection group and added one of the databases in question, it wouldn’t even offer us the 15-minutes sync points – even though the database was in FULL mode).

    Now, after changing it to SIMPLE and back to FULL, suddenly it offers the sync points.

    The answer is – allow me to phrase it like this – typical for Microsoft.

    – They, in a hurry, introduced a bug
    – They failed in properly testing older versions, so did not catch it; it was silently fixed later
    – Now, they won’t be able to reproduce it in current software versions (that’s why they couldn’t answer on it)

    We identify a quite serious bug in **their** products by spending *two days* in stupidly clicking around, trying to exclude all the possible other causes (we looked at all the database properties,…)

    That wouldn’t have happened to me on OS-X, sorry. (I have been happily using OS-X on my desktop for the last 10 years, switching after 15 years of lots of similar experiences on Windows.)

     
    • tomroush

      September 5, 2016 at 6:23 am

      Wow. Thank you so much for sharing this! Congrats on figuring it out.

       
  4. raphael

    November 28, 2016 at 12:05 pm

    Quick question here, I’m currently backing up both kinds of db’s (Full & Simple) in the same protection group with the following settings; Sync just before a recovery point, and Express Full backup x4 times during the day

    Aren’t the Express Full backups running synchronizations themselves before the recovery point is created? and if that’s the case then these settings should satisfy for a hybrid (both full & simple) protection group, right?

     
    • tomroush

      November 28, 2016 at 1:20 pm

      Valid question – it’s what we thought, too – which got us into trouble, and is essentially why I wrote the post. Take a look at how full your t-logs are.
      dbcc sqlperf(logspace)
      should do it for you – and take a look at whether the ones in full recovery have full-er transaction logs. Depending on your usage, my suspicion is that they do.
      If I were setting it up, I’d set up at least two protection groups – one for full recovery db’s, and one for simple recovery db’s. You might consider doing three groups, like this:
      1. system db’s (back these up, say, weekly)
      2. User DB’s (full)
      3. User DB’s (simple)

      But you’ll have to do some analysis on them to find out the activity level on them and which ones need it.
      Bottom line: if you’ve got a db in full recovery, you’ll want the checkbox/radio button checked to back it up as mentioned in the post.

      Good luck with it!

       
  5. jeffl

    February 6, 2017 at 6:40 am

    Hi Tom,

    I’ve read through this article three times, and I’m afraid I’m still missing something. The part I seem to be getting hung up on is the bit where you say that when this code executes as part of an hourly sync:

    BACKUP LOG [] …

    You then state, “you won’t have full transaction logs.” Why on earth not? If the SQL statement that is being executed is BACKUP LOG, isn’t that the T-SQL used for explicitly backing up a log file? The way I’m reading it, a sync is performed every hour; that means, every hour the BACKUP LOG gets executed, which–according to Microsoft’s documentation–produces a transaction log backup which by definition should truncate the log chain. All of my databases are in FULL recovery mode because they’re all in AlwaysOn Availability Groups.

    What, oh what, am I missing?

    –Jeff

     
  6. tomroush

    February 6, 2017 at 9:53 am

    Hey Jeff –
    The kicker is what’s happening behind the scenes… to quote from the post – here are the two relevant settings:

    And one day, I was on the phone about another DPM issue, (for another post) and I asked the question, “So what exactly is happening if I click on this button versus that one? Because my suspicion is that the tlog is not getting backed up at all.”

    And then I asked the more crucial question for me, and likely for you who are reading this: “What code is being executed behind these two options?”

    And the fellow over at Microsoft researched it for me and came back with this:

    “In DPM, when we setup synchronization for SQL protection, those Syncs are log backups. DPM achieves that by running the following SQL Query

    BACKUP LOG [] TO DISK = N’\DPM_SQL_PROTECT\\\_log.ldf\Backup\Current.log’

    Recovery Points are Express Full backups. If we setup a group to run sync just before recovery points, DPM will create a snapshot from the replica and then create a Full Database backup (sync).

    BACKUP DATABASE [] TO VIRTUAL_DEVICE=”
    WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

    In this case we will never have log backup thus no log truncation should be expected.”

    What does this mean?

    It means that if you have a database in full recovery, you will want to put it in a protection group that is set to schedule the backup every X minutes/hours…

    so…
    if you have it backing up every – you’ll get transaction log backups and your AG’s will be happy.
    if you have it backing up just before a recovery point, it won’t, and they won’t (because it’ll be doing that ‘with snapshot’ thing in the background and not backing up your transaction logs)

    Would have been nice if they’d made it clearer – and I understand the confusion – it took us quite a while to figure it out ourselves… I do hope this helps you though.

     
  7. tomroush

    February 6, 2017 at 9:57 am

    edit: the UI chopped off some stuff – this line: if you have it backing up every – you’ll get transaction log backups and your AG’s will be happy.
    should read as follows: if you have it backing up every (insert unit of time here) – you’ll get transaction log backups and your AG’s will be happy.

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: