Synchronizing Read / Write SQL Agent Jobs Across AlwaysOn Replicas

Intro

AlwaysOn availability groups are a powerful enhancement to SQLServer HA, and offer many improvements over older database synchronization techniques like mirroring and log shipping. However, there is also additional complexity involved in implementing this technology, and one of the stickiest areas for database administrators is having to deal with non-contained objects, or objects that exist at the instance level, but not the database level (logins vs users, as an example).

For new development, contained vs. uncontained objects can be accounted for and considered during application design. But what about those existing mission critical legacy applications that we want to benefit from AlwaysOn? Those are, undoubtedly, the instances that would benefit from automated failover and read / write workload balancing the most! Well, fortunately these older implementations can be modified to work seamlessly with AlwaysOn, but the instances they run on will most likely need some intelligent customization. Here’s one situation you might encounter: legacy applications using SQL Agent jobs to update data on a schedule.

Overview

In SQL Server AlwaysOn, all system level objects are non-synchronized, including SQL Agent. The assumption is that all applications need only database level objects to function, and that system level objects are only required for maintenance and administration of the instance. I had a client with a legacy mission critical application that they wanted to implement with AlwaysOn. This application relied and several SQL agent jobs to update data on a scheduled basis in order to maintain application integrity. Less then optimal, I know, but that’s what you get with legacy. The blocker with this implementation is that the jobs would fail if executed on a secondary replica, because the primary is the only write enabled replica.

So, how do we customize AlwaysOn replicas to ensure that these agent jobs are always and only running on the primary replica?

Solution

The solution I implemented is composed of four parts: A system function, a user stored procedure, a SQL Agent job, and a SQL Agent alert. The key to this implementation is ensuring that all of the jobs you need to manage are maintained in their own user defined category(s).

System Function

Microsoft provides us with a handy-dandy little system function: sys.fn_hadr_is_primary_replica. This function accepts one parameter, @dbname sysname, the name of a synchronized database, and “returns 1 if the current instance is primary replica; otherwise returns 0.” This is what we’ll use in our procedure to determine whether to enable or disable the read/write SQL Agent jobs.

Stored Procedure

We can now create a stored procedure to enable / disable SQL Agent jobs in a user defined category based on, coincidentally, the same bit value that sys.fn_hadr_is_primary_replica returns:

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO


/* =============================================
-- Author:		Jared Zagelbaum, jaredzagelbaum.wordpress.com
-- Follow me on twitter: @JaredZagelbaum
-- Create date: 4/23/2015
-- Description:	Enables Read / Write jobs on primary replica after failover
-- Can be executed from failover alert and / or recurring schedule
-- =============================================
*/
CREATE PROCEDURE [dbo].[JobChange_AGFailover]  
(
@databasename sysname
,@jobcategory sysname
)
AS

BEGIN
	
  SET NOCOUNT ON;

  DECLARE @jobID UNIQUEIDENTIFIER  --variable for job_id 
  DECLARE @AG_enable tinyint
 
  select @AG_enable = sys.fn_hadr_is_primary_replica (@databasename)
   
  DECLARE jobidCursor CURSOR FOR   --used for cursor allocation 
 
  SELECT j.job_id 
  FROM msdb.dbo.sysjobs j
  INNER JOIN msdb.dbo.syscategories c
  ON j.category_id = c.category_id
  where c.name = @jobcategory

--update jobs
  OPEN jobidCursor
  FETCH NEXT FROM jobidCursor INTO @jobID
  WHILE @@Fetch_Status = 0     
  BEGIN                
        EXEC msdb.dbo.sp_update_job @job_id=@jobID, @enabled = @AG_enable                 
		  FETCH Next FROM jobidCursor INTO @jobID    
		   END 

  CLOSE jobidCursor
  DEALLOCATE jobidCursor

END


GO


 SQL Agent Job

Having fun yet!? Now we can add this s proc to a job. Here’s a script that will create the job for you using the much preferred CmdExec method (assuming you create [dbo].[JobChange_AGFailover] in master). Make sure you update the supplied values where noted.

USE [msdb]
GO

/****** Object:  Job [AG Failover Replica Changes]    Script Date: 5/5/2015 2:17:49 PM ******/
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

END

DECLARE @jobId BINARY(16)
EXEC @ReturnCode =  msdb.dbo.sp_add_job @job_name=N'AG Failover Replica Changes', 
		@enabled=1, 
		@notify_level_eventlog=2, 
		@notify_level_email=2, 
		@notify_level_netsend=0, 
		@notify_level_page=0, 
		@delete_level=0, 
		@description=N'Enables / Disables read-write jobs against replicas when failover role change occurs', 
		@category_name=N'[Uncategorized (Local)]', 
		@owner_login_name=N'sa',						-- change if necessary per your environment
		@notify_email_operator_name=N'DBAdmins',		-- change to your default operator
		@job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object:  Step [Exec JobChange_AGFailover]    Script Date: 5/5/2015 2:17:49 PM ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Exec JobChange_AGFailover', 
		@step_id=1, 
		@cmdexec_success_code=0, 
		@on_success_action=1, 
		@on_success_step_id=0, 
		@on_fail_action=2, 
		@on_fail_step_id=0, 
		@retry_attempts=0, 
		@retry_interval=0, 
		@os_run_priority=0, @subsystem=N'CmdExec', 
		@command=N'sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -Q "EXECUTE [dbo].[JobChange_AGFailover] @Databasename = ''Databasename'', @jobcategory = ''jobcategory'' " -b', --supply your own databasename and jobcategory parameter values
		@output_file_name=N'$(ESCAPE_SQUOTE(SQLLOGDIR))\AGFailover_$(ESCAPE_SQUOTE(JOBID))_$(ESCAPE_SQUOTE(STEPID))_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM)).txt', 
		@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
    IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:

GO


 SQL Agent Alert

Finally, lets create an alert to trigger this job whenever a failover occurs. The alert should be raised based on Error number 1480. Here’s a script for it:

USE [msdb]
GO

EXEC msdb.dbo.sp_add_alert @name=N'AG Role Change', 
		@message_id=1480, 
		@severity=0, 
		@enabled=1, 
		@delay_between_responses=0, 
		@include_event_description_in=1, 
		@category_name=N'[Uncategorized]', 
		@job_name=N'AG Failover Replica Changes'
GO

 Result

You should now have an instance of each of these objects on all of your replicas:

Stored Procedure:

sproc

SQL Agent Job:

Job

SQL Agent Alert:

Alert

Alright, we now have an automated way of enabling and disabling read / write SQL Agent jobs based on whether the replica role is primary or secondary. The alert will be raised whenever a failover occurs. If, for some reason the alert does not trigger successfully (I’ve not encountered this), you can always start the job manually on the affected replica as well.

The final result that we want is that SQL jobs that edit data are disabled on secondary replicas and enabled only on the primary by job category. Multiple jobs and alerts can be added for different databases and/ or categories. You can view the enabled property value easily under Job Activity Monitor in SQL Server Agent.

Hope this helps!

Advertisements

11 thoughts on “Synchronizing Read / Write SQL Agent Jobs Across AlwaysOn Replicas

  1. sys.fn_hadr_is_primary_replica is only found in MSSQL 2014. Doesn’t exist in 2012, which means one must deduce programmatically via TSQL/PowerShell.

    Like

  2. Nice article. We considered something like this when moving to SQL 2012 and using AGs for HA. Didn’t get quite as far as you have.

    One problem with this which you don’t really address–maybe your situation is very static–is the need to synchronize the definitions/schedules/etc of the jobs between various AG replicas.

    After pursing this for a while we changed gears, and moved all of our AG-related jobs to a new 2-node FCI instance that we installed on the same underlying servers that our AGs run on (at no additional cost). This FCI has only one msdb to worry about, so no need to synch jobs between servers. Job steps that formerly acted on the local server were modified to connect to an AG listener. By connecting to the listener we’re guaranteed to run job against primary AG replica.

    Another advantage of using an FCI for this is that you can store the SSIS catalog here. MS says we’re not allowed to add this to an AG, and if you think “well I’m pretty sure I can” just know that before we were told this was a no-no, we added it to our AG. Later when we had a service pack install take down one of our servers, MS told us “SSISDB is a system DB, cannot go in an AG.” We also have a little bit of replication, and this FCI has proven a good place for the distribution DB too.

    In the end, this alternate FCI-based approach was very simple to implement and has worked fine for about 3 yrs so far. I would wholeheartedly recommend it over the approach you’ve described (sorry!).

    Like

  3. Thanks, Mike. It’s nice but “new 2-node FCI instance” means what? How disaster will be taken care in this new node? Can you please provide more detail. because I am too in same requirement.

    Like

    • While our small FCI provides HA, it does not provide an elegant DR solution. We’ve rolled our own backup/copy/restore approach but it’s nothing special. I think you’ve pointed out the weakest part of the solution. Also, this FCI idea only works for us because our storage is all SAN-based anyway–the SAN was already there.

      Going forward, if we move away from SANs, we’re considering using a single-node small VM for this same purpose (SQLAgent jobs, replication distributor, SSISDB and SSRS–things that you can’t put in an AG). This SQL instance will have to be licensed separately (where the FCI has been essentially free), but I think we can get by with an inexpensive license and then we’ll be able to rely on our VM admin tools to provide us with acceptable HA and DR capability.

      Like

  4. Thanks a lot Mike and Jared.

    We have a scenario where DBA is manually disabling the few jobs based on business requirements.
    But DBA is doing it only in one node. So how does these job status will be synch to the second node?
    Does DBA need to manually update the jobs in second node? Or do I need to maintain a separate table in availability group for maintain the status of jobs (disabled/enabled) then read it in second node during fail over?.

    Second question is, How the encryption key of SSRS behave in fail over? Do I need to import it in second node after fail over?

    All these scenario’s are real time and seems to be manual! So kindly guide and help the community.

    Like

    • The script that’s provided in this blog should handle the job status issue. As for reporting services on AlwaysOn, I’m trying to recall, but believe the keys are not different across the nodes, just unique to db. HOWEVER, the SSRS service needs to be restarted after failover in my experience. Something that could also be automated. If you need a highly available SSRS instance and have the licenses for a separate install, then FCI would probably be more appropriate for SSIS / SSRS.

      Like

  5. When a failover event is triggered, the stored procedure seems to be disabling all the Agent jobs on the server. The proc doesn’t seem to be aware of the agent jobs that are specific to the database. What if there are other jobs that belong to other databases and what about generic maintenance jobs on the server for all the databases?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s