Understanding SLA calculations and why your data might not be up to date

Understand how SLA records in ServiceNow are updated and recalculated, and what issues it may cause when reporting on SLAs.

Have you ever viewed an SLA record, only to have the values for elapsed percentage and time suddenly increase by multiple units? Or have you ever refreshed a list of SLA:s, without seeing the values of elapsed time change at all, even though time has passed since you last refreshed the list?

This is because calculated values in SLA records, like elapsed time and elapsed percentage, are not calculated on-the-fly. In fact, some SLA records only have their values updated once every 5 days (!). Depending on how big your backlog of open tasks is, and what you measure in terms om SLA metrics, this could have a big impact on your KPI:s.

So how are SLA:s updated, and can it be changed?

SLA recalculation is triggered by three things, apart from business rules triggered on actual changes made to the SLA:s (like changing an incident priority or resolving a task):

  • A unique scheduled job for each SLA
  • A business rule named “Calc SLAs on Display” on the task table
  • Scheduled calculation jobs

Of course, when an SLA is completed or cancelled, a final calculation is done and and the SLA no longer needs to be updated, so the information in this post only applies to open tasks with SLA:s in progress.

The unique scheduled job for the SLA

Each time an SLA is created, a job is created and scheduled to run at the breach time of the SLA. This scheduled job recalculates the SLA when the breach time is reached. If an SLA is changed or added, this job is rescheduled. The job runs once and is then removed. This means that an SLA is always updated and recalculated at the time it breaches.

The on-display business rule

The business rule “Calc SLAs on Display” runs when a records on the task table, or a table that extends task, is displayed. This means that each time someone opens an incident in a form view, all related SLA:s are recalculated. Note that the business rule does not run on display of the actual task_sla record, so opening the SLA record itself does not update it.

The business rule has the following scripted conditions:

gs.getProperty("glide.sla.calculate_on_display") === "true" && gs.getProperty("com.snc.sla.run_old_sla_engine") !== "true" && !current.isNewRecord() && GlideStringUtil.notNil(current.getUniqueValue())

What the conditions are evaluating is:

Condition PartExplanation
glide.sla.calculate_on_displaySystem property that has to be true for the BR to run. Out-of-the-box this value is set to true.
com.snc.sla.run_old_sla_engineSystem property that is false out-of-the-box. If true, your instance processes SLA:s using the old 2010 engine.
current.isNewRecord()Evaluates if the current form is an unsaved record that has not yet been inserted into the database. If that is the case, the BR wont run as there wont be any SLA:s created to calculate.
GlideStringUtil.notNil
(current.getUniqueValue()
Checks that the current record has an unique, non-empty, sys_id.


The business rule runs the following script:

(function executeRule(current, previous /*null when async*/) {
	// if this Task has unprocessed records in the "sla_async_queue" then do not call SLACalculatorNG
	if (new SLAAsyncQueue().isTaskQueued(current.getUniqueValue()))
		return;
	
	var task_sla = new GlideRecord("task_sla");
	task_sla.addQuery("task", current.sys_id);
	task_sla.addActiveQuery();
	task_sla.addQuery('stage','!=','paused');
	task_sla.query();
	while (task_sla.next()) {
		//Disable running of workflow for recalculation of sla.
		task_sla.setWorkflow(false);

		if (gs.getProperty("com.snc.sla.engine.version", "2010") === "2011")
			SLACalculatorNG.calculateSLA(task_sla);
		else {
			var slac = new SLACalculator();
			slac.calcAnSLA(task_sla);
		}
	}

})(current, previous);

First the script checks if there are SLA records related to this records which are currently queued for processing. This would only be true if your SLA engine is set to process SLA:s asynchronously.

if (new SLAAsyncQueue().isTaskQueued(current.getUniqueValue()))
		return;

The BR then queries the task_sla table for all records related to the task which are not inactive or paused:

var task_sla = new GlideRecord("task_sla");
	task_sla.addQuery("task", current.sys_id);
	task_sla.addActiveQuery();
	task_sla.addQuery('stage','!=','paused');
	task_sla.query();

If any SLA records are found, the script goes through each of them and calls the appropriate calculation method based on what SLA engine your instance is running (most likely 2011). The task_sla.setWorkflow(false) part prevents any workflow or other business rules from being triggered by the update :

while (task_sla.next()) {
		//Disable running of workflow for recalculation of sla.
		task_sla.setWorkflow(false);

		if (gs.getProperty("com.snc.sla.engine.version", "2010") === "2011")
			SLACalculatorNG.calculateSLA(task_sla);
		else {
			var slac = new SLACalculator();
			slac.calcAnSLA(task_sla);
		}
	}

Here is an example of the business rule running to recalculate the SLA:s on display of the task, in this case an incident. Notice how the actual elapsed percentage jumps from 202 to 206.

The scheduled jobs

But what if a record isn’t displayed, does that mean the SLA is never recalculated? The answer is no, thanks to scheduled jobs. There are multiple jobs running at different time intervals to recalculate the SLA:s. Understanding how these work can be important to making sure your metrics are correct.

To view the jobs, navigate to: System Scheduler > Scheduled Jobs and search for jobs with a name starting with “SLA Update“:

There are multiple jobs running at different intervals. Each job calculates a subset of active SLA:s based on how how much time is left before the the breach time. Lets look at “SLA Update (breach within 1 hour)“:

We can see that this job is scheduled to run every 10 minutes. It calls the calculateSLArange method of the SLACalculatorNG object, with two parameters, start and end. The start is a date and time 10 minutes into the future, and the end is 60 minutes into the future. This means we are calculating all Task SLA records with a breach time of between 10 minutes or 60 minutes from now.

Lets look at the method being called in the scheduled job:

SLACalculatorNG.calculateSLArange = function(start, end) {
	var lu = new GSLog(SLACalculatorNG.prototype.SLA_DEBUG, 'SLACalculatorNG');
	if (gs.getProperty(SLACalculatorNG.prototype.SLA_DATABASE_LOG, "db") === "node")
		lu.disableDatabaseLogs();
	lu.includeTimestamp();
	lu.logInfo('calculateSLArange: starting');

	// Array to hold the sys_id of each Task SLA we want to calculate
	var taskSlaIds = [];

	// Query for all task slas that are active, not paused, are under the max percentage, and have less time left than specified
	var maxPercent = gs.getProperty(SLACalculatorNG.prototype.SLA_CALC_PERCENTAGE, '');

	var taskSlaGr = new GlideRecord('task_sla');
	taskSlaGr.addActiveQuery();
	if (start)
		taskSlaGr.addQuery('planned_end_time', '>', start);
	if (end)
		taskSlaGr.addQuery('planned_end_time', '<', end);
	taskSlaGr.addNullQuery('pause_time');
	if (maxPercent != '')
		taskSlaGr.addQuery('percentage', '<=', maxPercent).addOrCondition('percentage', '');
	taskSlaGr.query();
	while (taskSlaGr.next())
		taskSlaIds.push('' + taskSlaGr.sys_id);

	lu.logInfo('calculateSLArange: ' + taskSlaIds.length + ' Task SLA records found to update');

	var sc = SLACalculatorNG.newSLACalculator();

	taskSlaGr = new GlideRecord("task_sla");
	for (var i = 0; i < taskSlaIds.length; i++)
		if (taskSlaGr.get(taskSlaIds[i])) {
			// if this Task has records in the "sla_async_queue" then do not process the calculation script for this Task SLA
			if (new SLAAsyncQueue().isTaskQueued(taskSlaGr.getValue("task")))
				continue;

			var oldLogLevel = sc.lu.getLevel();
			// if enable logging has been checked on the SLA definition up the log level to "debug"
			if (taskSlaGr.sla.enable_logging) {
				lu.setLevel(GSLog.DEBUG);
				sc.lu.setLevel(GSLog.DEBUG);
			}

			if (taskSlaGr.pause_time || !taskSlaGr.active) {
				lu.logInfo("calculateSLArange: Task SLA with sys_id " + taskSlaGr.getUniqueValue() + " has been paused or has become inactive since we started - skipping");
				continue;
			}
			sc.loadTaskSLA(taskSlaGr);
			sc.calcTaskSLAs();
			sc.updateTaskSLAs();

			lu.setLevel(oldLogLevel);
			sc.lu.setLevel(oldLogLevel);
		}

	lu.logInfo('calculateSLArange: finished');
};

The method queries the task_sla table for all active SLA:s with no value in the “pause_time” field (meaning they aren’t in a paused state) and a breach time between the two parameters for start and end. Then it iterates through each SLA and finally calculates the SLA:s and updates them using two other methods.

What’s important to note is that per default, these scheduled jobs stop updating the SLA when it has surpassed a certain value for “actual elapsed percentage”. That value is defined by a system setting “Percentage at which scheduled jobs stop refreshing Task SLA timings“, which by default is 1000.

It should be noted that this is only true for the scheduled jobs, which utilize the calculateSLArange method. The on-display business rule mentioned previously utilizes the calcAnSLA method, which does not take the maximum value into account, and calculates the SLA regardless.

In total there are six scheduled jobs, each with their own range:

Job NameInterval / Range of SLAs covered by jobRuns every:
SLA update (already breached)Breach time between now and 1 year ago1 day
SLA update (breach after 30 days)Breach time between 30 days from now and 1 year from now5 days
SLA update (breach within 10 min)Breach time between 1 minute from now and 10 minutes from now1 minute
SLA update (breach within 1 hour)Breach time between 10 minutes from now and 60 minutes from now10 minutes
SLA update (breach within 1 day)Breach time between 1 hour from now and 24 hours from now1 hour
SLA update (breach within 30 days)Breach time between 1 day from now and 30 days from now1 day

Notice that these jobs have a certain overlap in their ranges, which is why for example the “breach within 1 hour” job doesn’t look for SLA:s breaching within 10 minutes, as those are already covered by the “breach within 10 minute” job.

What does this mean for me when reporting?

First of all, realize that running queries against the task_sla table, or a database view like incident_sla, does not constitute a display of the actual record. This means the business rule to recalculate SLA records is not triggered. If no one opens the task, we are relying on the scheduled jobs to keep the SLA values updated until the task is closed.

  • If an SLA record is more than 30 days away from breaching, and no one has opened the task form, the elapsed percentage and elapsed time fields might be up to 5 days off in their calculated values.
  • If an SLA has surpassed 1000% of its actual elapsed time, and no one opens the task form, the SLA values will stop updating.
  • If an SLA is due to breach within a day, the calculated values are only updated once an hour unless someone opens the task.
  • If an SLA has breached more than one year ago, it stops being updated.
  • If an SLA has breached and no one opens the form, it is only updated once per day.

If we are just reporting on metrics such as “Number of open incidents with a breached SLA”, or “% of incidents resolved within SLA”, the limitations above will likely not matter. An SLA will always be shown as breached if the breach time has passed, thanks to the unique scheduled job created for that specific SLA record. It also wont affect metrics like “average elapsed SLA percentage in resolved incidents”. That metric is looking at completed SLA:s, which are updated when the related task is closed/resolved.

But the limitations could matter if you are looking at metrics like “Average Elapsed SLA Percentage in Open Incidents”, or using a bucket group-based breakdown for elapsed SLA percentage or elapsed time for open records.
Maybe you want to look specifically at the number of tasks with an elapsed SLA percentage of 200% or more?

It can also affect your flows or workflows if you are waiting for SLA-based conditions that becomes true after the SLA has breached. Maybe you are sending reminders to managers when 150% of the SLA has elapsed, and again when 200% has elapsed. What could happen is that both of those reminders are sent at the same time, when the “SLA update (already breached)” runs. If you have conditions waiting for more than 1000% of the SLA to elapse, those may never trigger at all.

Can I change the behavior of SLA calculations?

Yes you can, if you feel your use cases depend on it.

A simple thing to change would be the system property which modifies the maximum elapsed percentage for which the SLA engine will continue calculation for. As stated above, it is set to 1000% out of the box. If you know you have a lot of old tasks exceeding this, it could be a good idea to turn this up.

You could also increase the frequency of some of the scheduled jobs. Maybe you want the breached SLA:s to update more than once every day, which could be accomplished by changing the “SLA update (already breached)” job frequency.
Or maybe you want SLA:s breaching in 30 days or more to be recalculated more than once every 5 days; then change the frequency of the “SLA update (breach after 30 days)” job.

If you want list views of SLA:s to always be up-to-date, you could implement a business rule with the options “before” and “query” which updates all active and unpaused SLA:s before a database query for these is processed.

Also, note that the SLACalculatorNG script-include has a method called “calculateAll”, which will recalculate all active and unpaused SLA:s with an elapsed percentage less than the maximum value in the system property.

Be aware of the potential performance hit any change could have on your instance, which largely depends on how many active SLA:s you have.

Leave a Reply

Your email address will not be published.