Tuesday, August 23, 2011

Query timeout expired error in ASP application


Timeouts are common in any enterprise level application. With the huge size of database and concurrent users accessing application I am sure most of you might have encountered timeouts in application.

There can be many reasons for timeouts. The prominent ones are blocking or CPU/memory usage going high. Recently I had to work on application that was getting frequent timeout error.The error is shown here
             

On enquiring the details of application I was informed that it was an old ASP application and the database was running in SQL Server 2005.There is a particular report which when accessed throws the timeout error. A specific stored procedure was executed for this report and the parameter was date range supplied.
Like any DBA who has some knowledge on application programming and administration I thought of recommending them to increase the connection/command timeout value in connection string or IIS.But then a question flashed “Is this the best way to handle this issue?”.

The answer was No. Ideally I should monitor the instance for any blocking. If there are no blockings then check the CPU and RAM utilization. Execute a profiler and capture the statements utilizing high CPU and tune them and check if still you get timeouts or not. If still the issue persists then request network admin to start a netmon trace to check if data-packets across network are getting lost. 

On checking the instance for blocking (blocked processes) I couldn’t find anything. There was one more application on which I was closely working that also has the same issue of timeout but it  was predominantly due to processes blocking each other. But in this instance I couldn’t find any process that was blocked. Now the next doubt was CPU or memory usage.

To monitor the statements consuming high CPU I decided to start a trace using profiler. Please make sure that whenever working on application related timeout issue the event Attention in profiler is selected. This event would get fired whenever application gets a timeout. You can then check which TSQL statement was executed just prior to this timeout. The events selected in profiler were:
Attention
User Error Message
RPC: Starting
RPC: Completed
SP:Starting
SP:Completed
SP:StmtStarting
SP:StmtCompleted
SQL:BatchStarting
SQL:BatchCompleted 

On running the trace what I found was certainly the CPU time, reads and duration for the particular SP was high and I was able to locate the exact TSQL statement which was throwing timeout.

As shown in the screenshot the update statement was creating issue. The reads were high and it was consuming high CPU cycles. This update was executed inside a while loop and the same trace file  had details about the reads and CPU of previously completed update statements.
Now the next part was to check the execution plan of this update statement and create appropriate indexes. I was able to find a scan on a table that was having 4313995 records. Suggested them to create non clustered indexes on this table so that the table scan was replaced by a seek. Once the indexes were created the timeout errors were resolved. On checking the trace again what I found were the CPU time was drastically reduced.




This is one of the reasons for getting timeout errors. I am sure there are many other factors some of them I have explained before (blockings, network issue etc).Normally application developer thinks that these timeout are issues related to database but what I have seen is they are due to bad TSQL code or database/application design.


Monday, June 27, 2011

Nested Loop vs Hash Match

Last week I was called to help a team facing performance issue in one of the production database instance. This was the statement
Our application performance has good overall past 2 weeks, however today we notice that this one single SP is taking 50 secs to execute when the same took 2 secs yesterday .


The issue was with single stored procedure. The first thing to check was if there is any blocking in the database instance. No, nothing!!!. I couldn’t find any blocked processes in the instance.


Since the issue was with single stored procedure I started to perform the steps which every DBA will do.Took the list of tables on which this stored procedure is dependent on and checked their fragmentation details. No the tables were not fragmented.


OK then maybe there would be issue with statistics of these tables. Probably the statistics of tables were not updated. But when I checked the statistics were fine.


At this point I decided that a trace needs to be started to capture the execution plan of the queries. On  analyzing the execution plan I found that a particular select statement doing join on huge tables was using Nested Loop join. I remember the lines from a book written by Grant Fritchey that nested loop join are appropriate for small tables and for large tables a hash match join would be a better join.



Whoop is this the problem. Is optimizer creating a wrong plan? If so why. Do I need to give join hints here and see if optimizer generates a better plan? I was thinking about these all things then suddenly a though flashed across my mind. Why don’t flush the plan cache and see. I use the command
 DBCC FLUSHPROCINDB to flush all the procedure cache for the particular database .



Once the plan cache was flushed the stored procedure was behaving as expected. It was responding fast. Now I decided to capture trace with execution plan once again  to check if the nested loop join operator was changed or not. As expected the query was using hash match join and not nested loop.

Probably updating the statistics would have resolved the issue but the tables were huge and updating the statistics would have taken more than 20 minutes. I could have tagged the procedure cache to be recompiled by using sp_recompile stored procedure but at that time flushing the procedure cache looked a better option. The changed execution plan is given below.

   
Nested loop join works better if the tables joined are small.As long as you have small dataset being joined this is the best join that you would expect to see.

Hash Match joins are often very effecient on large datasets specially if one table is much smaller than the other.It works well  on tables which are not sorted on columns used in join condition.Be aware that sometimes it may also be an indication of a missing where clause or an of an index.

The KB article has the details on how the three join operators normally work
http://support.microsoft.com/kb/197297


Thursday, May 5, 2011

Disabled jobs run automatically in SQL Server 2005/2008.

A common myth  is that there is a bug in SQL Server 2005/2008 that a disabled job will run automatically in  SQL Server unless its schedule is disabled. I was also thinking in the same way  till today this is when I decided to test the same.

The first step was to create a SQL Server Test job that executes sp_who2 in its step. The schedule of the job was run daily every minute. As usual job was running every minute.

Now decided to update the status of job in sysjobs  table in msdb.I disabled it using the query

update sysjobs set enabled=0 where name='Test_SQLBugJob'

The job was disabled as shown in the screenshot.

But to when I noted the job history what I found was the job was still executing.

The next thing to do was to update sysjobactivity table since it has a column next_scheduled_run_date so if I can make it null probably the job won’t execute.The query  given below was executed.
update sysjobactivity set next_scheduled_run_date=null where job_id=(select job_id from sysjobs where name='Test_SQLBugJob')

But still after the job was executing as per schedule. Though the next_scheduled_run_date was null but the job will continue running and  once its executed it will update the next_scheduled_run_date column with a new value.

Decided to give a try with SQL Server Management Studio. When disabled through SQL Server management studio things were good and the job never ran. Now what  commands does SQL Server management studio executes.

Started a profiler and captured the commands. Found that Management studio executes a command
sp_update_job.This stored proc in turn executes  another stored procedure sp_sqlagent_notify
which again executes an extended stored procedure xp_sqlagent_notify which is in SQLServer library file xpstar.dll.


SQLServer agent caches the data in memory.Unless  sp_aqlagent_notify is executed the cache is not refreshed.When you  use SSMS then the command gets executed and SQL Server Agent refreshes its cache will the new setting.
One of my friend  pointed out that the bug is logged in Microsoft lab
https://connect.microsoft.com/SQLServer/feedback/details/552707/job-disabled-and-schedule-enabled#details

But if the workaround is to restart the SQL Server agent which means its refreshing the cache.So the point here is if you want to disable a SQL Server agent job never execute plain TSQL update statement.Either you use SSMS or execute the stored proc sp_update_job.