We had a server failover recently during a demonstration to one of the C-level folks, and as you might imagine, it was brought to my attention. The reason for the failover wasn’t something I could control, but the length of time the relevant database took to come up was something I could – so when it finally came up, I noticed the errorlog with this little note:
Database <dbname> has more than 1000 virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files.
More than 1000, huh?
Slow recovery from failover, huh?
Where had I seen this before? Oh yeah, some years back, when we were doing datacenter maintenance, and the *one* sharepoint database that had all our documentation on (ahem) “How to bring up a datacenter.” in it took a half an hour to come back – I remember now!
So if you don’t know – having a metric buttload (note – that’s a real term there, go look it up) of virtual log files means that every time your server fires up a database, it has to replay the transaction log to get all up to date and everything, and if the transaction log’s got a bunch of virtual log files, it’s like trying to run data backwards through a shredder… it takes time, and that’s the issue.
And got this nice little message:
Msg 213, Level 16, State 7, Line 1
Column name or number of supplied values does not match table definition.
Back to “Hmmm…”
Well, I knew dbcc loginfo(dbname) from her script was inserting into a table that the script created, and it had worked before. It had been awhile since I’d run it – so I figured maybe there was some change in what it returned based on the SQL version. So I decided to see what it was inserting…
Well hey – the original script was missing a field (first column above)
So I added it (RecoveryUnitID – the first one below)
…and it worked fine.
Turns out SQL 2008R2 (where the original script worked) returns different fields than 2012 and 2014 (where it didn’t).
I figured I didn’t want to find out which version of the script to use every time I needed to run it on a server, so I told the script to figure that out by itself, and then run the appropriate hunk of code (example below)
So what you see in that code is a combination of this:
https://msdn.microsoft.com/en-us/library/ms174396.aspx from Microsoft to check the version number,
http://sqlfool.com/2010/06/check-vlf-counts/ from Michelle Ufford.
A couple of things to understand here:
This code won’t fix the problem (read Kimberly’s posts below for that)
- It will, however, show you that a problem exists.
- …that you have to fix and manage (Back to Kimberly’s posts)
You’ll want to do some fixing that involves:
- Lots of log backups and log file shrinking, then log growing (See Kimberly’s posts for more info on this)
- Adjusting the auto-growth setting from the 1 mb/10% default setting to something significantly larger (Meanwhile, back in Kimberly’s posts… Seriously, check them out)
- Oh, and make those adjustments on the Model database so that every new database you create has decent growth patterns instead of the 1 mb/10% bit)
- Tadaa – you should find much faster failovers on your clustered systems (assuming you’re still using them and not Availability groups)
So… that made it look easy, right? Well, let me give you the Paul Harvey Version, “The Rest of the Story.”
What I learned, and what complicated things a bit…
My system was an Availability group with two nodes running synchronously, and two nodes in two different datacenters running asynchronously.
They’re backed up by Microsoft’s Data Protection Manager (DPM) that is configured to back up the secondary node.
So my plan was to back up the secondary – only when I did, it told me that the last recovery point (what a backup’s called in DPM) had something to do with Marty McFly, namely that it had been backed up in the future – about 8 hours into the future.
Hmm… Okay, that told me the secondary it was backing up was one of the asynchronous ones, and it was either in the land of fish and chips, or the land of beer and pretzels, either way – it was a weird way to think about trying to preserve the recovery chain.
I created recovery points through DPM, which did it on one of those secondaries, which, strangely, helped to clear the log on the primary.
Then I shrunk the log, in stages, on the primary, which then replicated out to the various secondaries locally and across the pond.
I was able to get the VLF count down from well north of 33,000 to just over 600. The log wouldn’t shrink any more than that, which told me there was a transaction hanging on toward the end of one of the the virtual log files, and frankly, for right then, that was okay.
To fix it further, I almost entertained the idea of flipping the recovery model from full to simple, clearing the log file completely, then flipping it back to full, but that would have totally broken my recovery chain, and with three secondaries, I had no desire to try to fix all the fun that would have created. (Translation: Rebuilding the availability group – Uh… No.)
I then grew the log file to an appropriate size for that database, then increased the growth rate on the database files to something far more in line with what Kimberly mentioned above.
All of this was done on the primary, and the changes I made propagated out nicely, so I only had to make them in one spot.
Oh, while I was researching this…
(I’d done this before – reducing VLF counts, but never on an Availability Group, much less one with multiple secondaries in different time zones where DPM is backing up one of the secondaries, where it’s not obvious which one of the secondaries is the one being backed up)
…I noticed other people had clearly run into this same issue (reducing VLF counts) – and had not only entertained the idea of automating VLF reduction, but taken the idea out drinking, dancing, and somehow ended up with a receipt for a cheap motel room – and then put it all into a stored procedure.
In a job…
On a schedule…
I decided that breaking the recovery chain was bad enough, but shredding it was so much worse – so this part will not be automated.
Oh – and here’s the code. It’s Michelle’s, with a couple of tweaks from me and one from Microsoft.
--Code to determine how many virtual log files there are in a database. --modified from the original code here: http://sqlfool.com/2010/06/check-vlf-counts/ --to check for multiple versions of SQL because of schema changes between versions --if any of the results are > 1000, do multiple log backups (or DPM Recovery points --to clear out the transaction log + shrinking the tlog to get rid of --the extra virtual log 'fragments'. --As always - test this in your test environments first. Your mileage may vary. Some --settling of contents may have occurred during shipping. You know the drill folks. --Best reference out there for managing this: -- http://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/ -- http://www.sqlskills.com/blogs/kimberly/8-steps-to-better-transaction-log-throughput/ --the code below handles SQL 2008R2, 2012, and 2014. If you're running something other than that, --just run a dbcc loginfo(YourDatabaseName) and make sure the #stage table has the same --columns in it --if it doesn't, add a section below. DECLARE @sqlversion SYSNAME SET @sqlversion =(SELECT CONVERT(SYSNAME,SERVERPROPERTY('ProductVersion'))) PRINT @sqlversion -- SQL 2008R2 IF @sqlversion LIKE '10.50.%' BEGIN CREATE TABLE #stage ( FileID INT , FileSize BIGINT , StartOffset BIGINT , FSeqNo BIGINT , [Status] BIGINT , Parity BIGINT , CreateLSN NUMERIC(38) ); CREATE TABLE #results ( Database_Name SYSNAME , VLF_count INT ); EXEC sp_msforeachdb N'Use [?]; Insert Into #stage Exec sp_executeSQL N''DBCC LogInfo([?])''; Insert Into #results Select DB_Name(), Count(*) From #stage; Truncate Table #stage;' SELECT * FROM #results ORDER BY VLF_count DESC DROP TABLE #stage DROP TABLE #results END --sql 2012/2014 - note - new temp table names because sql thinks old ones are there. IF @sqlversion >= '11' BEGIN CREATE TABLE #stage2 ( RecoveryUnitID INT , FileID INT , FileSize BIGINT , StartOffset BIGINT , FSeqNo BIGINT , [Status] BIGINT , Parity BIGINT , CreateLSN NUMERIC(38) ); CREATE TABLE #results2 ( Database_Name SYSNAME , VLF_count INT ); EXEC sp_msforeachdb N'Use [?]; Insert Into #stage2 Exec sp_executeSQL N''DBCC LogInfo([?])''; Insert Into #results2 Select DB_Name(), Count(*) From #stage2; Truncate Table #stage2;' SELECT * FROM #results2 ORDER BY VLF_count DESC DROP TABLE #stage2 DROP TABLE #results2 END
Thanks to the #sqlhelp folks for their advice, suggestions, good-natured harassment, and ribbing:
and of course, the ones whose reference material I was able to use to put this together:
When I told Argenis about the DPM involvement and secondaries in multiple timezones, his response 🙂