Monthly Archives: January 2016

Adventures in Fixing VLF issues on a database in an Availability Group, backed up with DPM, with multiple secondaries

Hey all,

We had a server failover recently during a demonstration to one of the C-level folks, and as you might imagine, it was brought to my attention. The reason for the failover wasn’t something I could control, but the length of time the relevant database took to come up was something I could – so when it finally came up, I noticed the errorlog with this little note:

Database <dbname> has more than 1000 virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files.


More than 1000, huh?

Slow recovery from failover, huh?


Where had I seen this before?  Oh yeah, some years back, when we were doing datacenter maintenance, and the *one* sharepoint database  that had all our documentation on (ahem) “How to bring up a datacenter.”  in it took a half an hour to come back – I remember now!

So if you don’t know – having a metric buttload (note – that’s a real term there, go look it up) of virtual log files means that every time your server fires up a database, it has to replay the transaction log to get all up to date and everything, and if the transaction log’s got a bunch of virtual log files, it’s like trying to run data backwards through a shredder… it takes time, and that’s the issue.

I remembered Michelle Ufford’s ( blogtwitter ) post that I’d found back then about counting Virtual log files and tried it on this server now that it was up.

And got this nice little message:

Msg 213, Level 16, State 7, Line 1

Column name or number of supplied values does not match table definition.

Back to “Hmmm…”

Well, I knew dbcc loginfo(dbname) from her script was inserting into a table that the script created, and it had worked before. It had been awhile since I’d run it – so I figured maybe there was some change in what it returned based on the SQL version. So I decided to see what it was inserting…

Well hey – the original script was missing a field (first column above)

So I added it (RecoveryUnitID – the first one below)

…and it worked fine.

Turns out SQL 2008R2 (where the original script worked) returns different fields than 2012 and 2014 (where it didn’t).

I figured I didn’t want to find out which version of the script to use every time I needed to run it on a server, so I told the script to figure that out by itself, and then run the appropriate hunk of code (example below)

So what you see in that code is a combination of this: from Microsoft to check the version number,

and this from Michelle Ufford.

A couple of things to understand here:

  1. This code won’t fix the problem (read Kimberly’s posts below for that)
  2. It will, however, show you that a problem exists.
  3. …that you have to fix and manage (Back to Kimberly’s posts)

You’ll want to do some fixing that involves:

  1. Lots of log backups and log file shrinking, then log growing (See Kimberly’s posts for more info on this)
  2. Adjusting the auto-growth setting from the 1 mb/10% default setting to something significantly larger (Meanwhile, back in Kimberly’s posts… Seriously, check them out)
  3. Oh, and make those adjustments on the Model database so that every new database you create has decent growth patterns instead of the 1 mb/10% bit)
  4. Tadaa – you should find much faster failovers on your clustered systems (assuming you’re still using them and not Availability groups)

So… that made it look easy, right? Well, let me give you the Paul Harvey Version, “The Rest of the Story.”

What I learned, and what complicated things a bit…

My system was an Availability group with two nodes running synchronously, and two nodes in two different datacenters running asynchronously.

They’re backed up by Microsoft’s Data Protection Manager (DPM) that is configured to back up the secondary node.

So my plan was to back up the secondary – only when I did, it told me that the last recovery point (what a backup’s called in DPM) had something to do with Marty McFly, namely that it had been backed up in the future – about 8 hours into the future.

Hmm… Okay, that told me the secondary it was backing up was one of the asynchronous ones, and it was either in the land of fish and chips, or the land of beer and pretzels, either way – it was a weird way to think about trying to preserve the recovery chain.

I created recovery points through DPM, which did it on one of those secondaries, which, strangely, helped to clear the log on the primary.

Then I shrunk the log, in stages, on the primary, which then replicated out to the various secondaries locally and across the pond.

I was able to get the VLF count down from well north of 33,000 to just over 600. The log wouldn’t shrink any more than that, which told me there was a transaction hanging on toward the end of one of the the virtual log files, and frankly, for right then, that was okay.

To fix it further, I almost entertained the idea of flipping the recovery model from full to simple, clearing the log file completely, then flipping it back to full, but that would have totally broken my recovery chain, and with three secondaries, I had no desire to try to fix all the fun that would have created. (Translation: Rebuilding the availability group – Uh… No.)

I then grew the log file to an appropriate size for that database, then increased the growth rate on the database files to something far more in line with what Kimberly mentioned above.

All of this was done on the primary, and the changes I made propagated out nicely, so I only had to make them in one spot.

Oh, while I was researching this…

(I’d done this before – reducing VLF counts, but never on an Availability Group, much less one with multiple secondaries in different time zones where DPM is backing up one of the secondaries, where it’s not obvious which one of the secondaries is the one being backed up)

…I noticed other people had clearly run into this same issue (reducing VLF counts) – and had not only entertained the idea of automating VLF reduction, but taken the idea out drinking, dancing, and somehow ended up with a receipt for a cheap motel room – and then put it all into a stored procedure.

That ran…

In a job…

On a schedule…


I decided that breaking the recovery chain was bad enough, but shredding it was so much worse – so this part will not be automated.

At all.

Oh – and here’s the code. It’s Michelle’s, with a couple of tweaks from me and one from Microsoft.

--Code to determine how many virtual log files there are in a database.
--modified from the original code here:
--to check for multiple versions of SQL because of schema changes between versions
--if any of the results are > 1000, do multiple log backups (or DPM Recovery points
--to clear out the transaction log + shrinking the tlog to get rid of
--the extra virtual log 'fragments'.
--As always - test this in your test environments first.  Your mileage may vary. Some 
--settling of contents may have occurred during shipping.  You know the drill folks.
--Best reference out there for managing this:
--the code below handles SQL 2008R2, 2012, and 2014. If you're running something other than that,
--just run a dbcc loginfo(YourDatabaseName) and make sure the #stage table has the same
--columns in it
--if it doesn't, add a section below.
DECLARE          @sqlversion SYSNAME
SET              @sqlversion =(SELECT CONVERT(SYSNAME,SERVERPROPERTY('ProductVersion')))
PRINT            @sqlversion
-- SQL 2008R2

IF @sqlversion LIKE '10.50.%'
          CREATE TABLE #stage
                                     FileID        INT
                                   , FileSize      BIGINT
                                   , StartOffset   BIGINT
                                   , FSeqNo        BIGINT
                                   , [Status]      BIGINT
                                   , Parity        BIGINT
                                   , CreateLSN     NUMERIC(38)

          CREATE TABLE #results
                                     Database_Name SYSNAME
                                   , VLF_count     INT
          EXEC sp_msforeachdb  N'Use [?];
                                Insert Into #stage
                                Exec sp_executeSQL N''DBCC LogInfo([?])'';
                                Insert Into #results
                                Select DB_Name(), Count(*)
                                From #stage;
                                Truncate Table #stage;'
         SELECT         *
         FROM           #results
         ORDER BY       VLF_count DESC
         DROP TABLE     #stage
         DROP TABLE     #results
     --sql 2012/2014 - note - new temp table names because sql thinks old ones are there.
IF @sqlversion >= '11'
          CREATE TABLE #stage2
                                     RecoveryUnitID INT
                                   , FileID         INT
                                   , FileSize       BIGINT
                                   , StartOffset    BIGINT
                                   , FSeqNo         BIGINT
                                   , [Status]       BIGINT
                                   , Parity         BIGINT
                                   , CreateLSN      NUMERIC(38)
          CREATE TABLE #results2
                                   Database_Name SYSNAME
                                   , VLF_count INT
          EXEC        sp_msforeachdb N'Use [?];
                                     Insert Into #stage2
                                     Exec sp_executeSQL N''DBCC LogInfo([?])'';
                                     Insert Into #results2
                                     Select DB_Name(), Count(*)
                                     From #stage2;
                                     Truncate Table #stage2;'
          SELECT         *
          FROM           #results2
          ORDER BY       VLF_count DESC
          DROP TABLE     #stage2
          DROP TABLE     #results2

Thanks to the #sqlhelp folks for their advice, suggestions, good-natured harassment, and ribbing:

Allen McGuire (blog | twitter)
Argenis Fernandez (blog | twitter)
Tara Kizer @tarakizer (blog | twitter)
Andre Batista @klunkySQL (blog | twitter)

and of course, the ones whose reference material I was able to use to put this together:

Michelle Ufford (blog | twitter)
Kimberly Tripp (blog | twitter)
Paul Randal (blog | twitter)

When I told Argenis about the DPM involvement and secondaries in multiple timezones, his response 🙂


Posted by on January 5, 2016 in Uncategorized


Tags: , , , , ,