image8

SQL Query monitoring in Orion SAM – best practice

Postat av Thomas Löfstrand on 27. juli 2022

As a SolarWinds Thwack MVP and a MS SQL DBA since many years, combining SolarWinds Orion with SQL is close to hand. Especially SAM, Server and Application Monitoring, and the “SQL User Experience Monitor Component” has helped me in several cases. With this component you can query a monitored SQL Instance with almost any query you want, as long as you format the output from the query in the correct way.

 

The question, the query, must be formulated so that the answer, the output, is a number. For example:

  • How many errors have there been in the last 5 minutes?
  • How many backups has failed in the last 12 hours?
  • How many databases has NOT a valid backup in the last 12 hours?
  • How long is the queue?
  • How old is the oldest post in the queue?
  • How many new orders have we received in the last 10 min?

 

As you see, the questions can vary a lot. Not just looking at how many errors or failures we have, but also look at how long a queue is or how many new posts we have in the last x minutes. By knowing what is normal we can alert if something is not normal, meaning something in the data flow chain is not working as expected.

 

Configure the SQL User Experience Component

When you add the SQL User Experience Monitor Component to a SAM Template, you are presented with below form. I will go through a few of them, to get info on all fields please go to solarwinds.com.

 

image-1

Description

I like to add a good description of what this component does and what to do when there is an alert triggered here. I do this on all components. In 6 months, you will have forgotten, and the people receiving the alert will most likely not know what this is all about, so I send this text in the alert email. Could also be a link to a KB.

 

Credentials for Monitoring

An account that has read-permission on the database.

 

Port and SQL Server Instance

If you use a SQL default instance you don’t have to do anything. If you have a named instance and SQL Browser service available, just fill in the Instance. If that does not work and the SQL Instance listen on a non-standard port, put in the Port Number by changing Port Type from “Use Default Port” to “Use Static Port” and add the port number.

image-2

 

Initial Catalog

What database the query should be ran against. You cannot use “USE [DBName];” in the query.

 

Windows Authentication first, then SQL authentication

Check the box if the credentials you are using are a Windows/active directory account. It will still work but you will get clutter in the log files. It’s just wrong.

 

Orion Requirements on the SQL Query

The output from the query must follow some rules for Orion to accept it.

  • The output must be a number in the first cell
  • The second cell can be a string
  • No lines beside the first line will be used

image3

 

With above output we give a “0” to Orion as the value we trigger alerts on and create graphs on. The Message “Optional Message” is as it says optional, but I like to use it to provide extra info on what is wrong. The number value MUST be provided. This gives that queries that sometimes return nothing – null, maybe if no rows are found, will not be valid and must be handled.

 

Demo SQL Table

In below examples I will use the Orion database itself for the queries.

 

Simple example

First example, “how many nodes are down”?

SELECT

COUNT(N.Nodeid) AS [Statistic]

FROM Nodes AS N

WHERE N.[Status]=2

image4

 

Another simple example, “how many ‘application down events’ have we had in the last hour”

SELECT

COUNT(E.Eventid) AS [Statistic]

FROM Events AS E

WHERE

E.EventTime>Dateadd(HOUR,-1,Getdate()) — Events last hour

AND E.EventType=505 – Event type Application down

 

We can also add a simple [Message] to the query to utilize that extra field we have:

SELECT

COUNT(E.Eventid) AS [Statistic]

,CAST(COUNT(E.Eventid) AS varchar(2)) + ’ down events’ AS [Message]

FROM Events AS E

WHERE

E.EventTime>Dateadd(HOUR,-1,Getdate()) — Events last hour

AND E.EventType=505 – Event type Application down

image5

If we want to know what application that was down, can we just add the [E].[Message] column as Message? We then need to GROUP BY [E].[Message] but then we will get one row per message and above we said we can only have one row. Now it’s starting to get complicated – next chapter!

Advanced example
In my experience there are two big difficulties when writing a bit more advanced queries for the SQL component:

Getting the message from several rows into one row
Getting a 0, zero, when there are no issues

To combine the message column from several rows into one single cell I use “FOR XML PATH”:

SELECT

N.Caption + ’, ’

FROM Nodes AS N

WHERE N.[Status]=2

FOR XML PATH (”)

image6

To combine it all in a bit more complex query but with a reusable structure, we are going to look at a common example. “How many SQL agent jobs whose name start with “xxx” has failed in the last “yy” hours”?

To get order and structure in the query I use a CTE – Common Table Expression. CTE is a way of using several “subqueries” to get out parts of the result in several steps, where subsequent steps can get data from previous steps. You will see how they are used soon or read more here for example: learnsql.com

First, we take out a list of what SQL agent jobs that has failed:

— Variable Declarations

Declare @TimeCheck AS Datetime

Declare @SearchFor AS nvarchar(20)

— Initialize Variables

set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) — How long back to search for failed jobs

Set @SearchFor=’OffLoad_’ –Job Name string to search for

SELECT

Count(h.instance_id) AS S1

,j.[name] AS [JobName]

FROM dbo.sysjobhistory AS h

INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id

WHERE

CAST

( CONVERT(VARCHAR, h.run_date)

+ ’ ’ + STUFF(STUFF(RIGHT(’000000′

+ CONVERT(VARCHAR,h.run_time),6),5,0,’:’),3,0,’:’)

AS DATETIME

)> @TimeCheck AND

h.run_status=0 AND –0=Failed

h.step_id=0 AND — 0=Whole Job outcome

j.[name] LIKE @SearchFor + ’%’

GROUP BY J.[name]

The result looks like this:
image7

Then we want to get all the message text into one row instead of several, using XML Path again:

SELECT

’”’ + TheSource.JobName +’”’ + ’ failed ’ + CAST(TheSource.[S1] AS varchar(2)) + ’ times ’ + ’, ’

FROM TheSource

FOR XML PATH(”)

 

“FROM TheSource”?? That is the name of the first CTE, you will see when you get the whole query. But now we have the message part as one field with the help of a CTE.

 

Next, I will count how many times jobs has failed. Maybe an unnecessary step but it gives an inner peace of mind with a nice, structured order.

SELECT

SUM(TheSource.S1)

FROM TheSource

 

Finally, we combine all the data in the answer, the output:

SELECT

ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic

,TheMessage.MessageText AS Message

FROM TheMessage

 

The ISNULL-part is to send a 0, zero, if there are no issues found. Remember, we always have to send a number. Otherwise, SAM thinks the query is broken.

 

Combining all the parts we get this query that will give you how many SQL agents named “OffLoad_*” that has failed last 24 hours:

 

— Variable Declarations

DECLARE @Count INT

Declare @TimeCheck AS Datetime

Declare @SearchFor AS nvarchar(20)

 

— Initialize Variables

SET @Count = 0

set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) — How long back to search for failed jobs

Set @SearchFor=’OffLoad_’ –Job Name string to search for

 

 

;WITH TheSource AS — The source query from here ….

(

SELECT

Count(h.instance_id) AS S1

,j.[name] AS [JobName]

FROM dbo.sysjobhistory AS h

INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id

WHERE

CAST

(   CONVERT(VARCHAR, h.run_date)

+ ’ ’ + STUFF(STUFF(RIGHT(’000000′

+ CONVERT(VARCHAR,h.run_time),6),5,0,’:’),3,0,’:’)

AS DATETIME

)> @TimeCheck    AND

h.run_status=0 AND –0=Failed

h.step_id=0 AND — 0=Whole Job outcome

j.[name] LIKE @SearchFor + ’%’

GROUP BY J.[name]   — … to here

),

TheMessage (MessageText) AS

(

SELECT

’”’ + TheSource.JobName +’”’ + ’ failed ’ + CAST(TheSource.[S1] AS varchar(2)) + ’ times ’ + ’, ’   — Format the output message

FROM TheSource

FOR XML PATH(”)

),

Counting (count1) AS

(

SELECT

SUM(TheSource.S1)

FROM TheSource

)

SELECT

ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic

,TheMessage.MessageText AS Message

FROM TheMessage