We recently received some feedback that the performance for one of our major applications had degraded overtime and needed to be improved. The application is a ASP.NET MVC 4 web application with dependencies on several WCF services which each have SQL databases. So basically the typical enterprise Microsoft web development stack.
In this blog post we will review the steps to measure, diagnose and improve performance.
The key to diagnosing performance issues is having metrics data and lots of it, from different sources if possible. Without the data you are just guessing, sometimes with the data you are still guessing but it’s at least more informed. The more data you have, the more likely you are to correlate an issue with a metric which could lead to a root cause fix.
We use Google Analytics (GA) as the first measure of performance from the application users perspective. We measure AJAX load timings via custom user timings and page load times are measured by default. GA is a free service and great if you want to get started quickly/easily. Some configuration will likely be required to get it reporting correctly.
We use MiniProfiler to see live profiling metrics for ASP.NET MVC sites and also store the metrics to a database for later comparison.
For server-level and app-level metrics we have LogicMonitor. This provides CPU, memory, IO, and many more metrics. These can help identify environment resources and host issues.
Lastly, we have written automated performance/load tests using Visual Studio Load/Web Tests. These allow us to perform stress tests on the application and see where/when it has problems throughout the application.
These are just some examples of the options available but there are many more. AppDynamics, Application Insights from Microsoft, Zabbix, DataDog, etc. The point is you need multiple sources of telemetry data to help pinpoint the root cause of your performance issues.
Armed with a slew of various metrics we started digging into the data to try to find correlation between top level performance and potential root causes.
During this time we also noticed a significant bump in page load time in production. Notice the trend in the last 3rd of the graph line is double where it had been previously.
This is obviously concerning. We are trying to improve performance not make it worse. So this became priority number 1.
Note – You can ignore the one day spikes in the graph above as this is a known issue on the weekends where performance metrics are skewed because there is little to no traffic.
Next we compared the metrics in the production environment to the QA environment to see if they matched. This would indicate an application issue that is effecting multiple environments. In this case, QA did not mimic the PROD performance metrics.
This likely means the performance issue is more of an environmental issue.
So next we checked the environment resources in LogicMonitor to see if were capping out on CPU, memory or disk space. In this particular case everything looked okay. No capping or big jumps correlated to the GA metrics.
These were the resources for our web servers. Moving down the stack, we checked our service servers and they all looked okay as well from a resources perspective.
Continuing down the stack we started researching the SQL database and with the help of our DBOps team found a significant spike in deadlocks and lock timeouts. The timing of this jump correlated to the GA metrics within 30 minutes which was a strong indicator the two were related.
We also checked the SQL database in the QA environment but there was no significant locking occurring. Again suggesting an environmental issue.
Looking further we also correlated a deployment to this database in the same 30 minute timeframe the jump occurred which also strengthens the theory that the database is to blame. So the deployment triggered the change and the root cause must be environment related because we only see the issue in production.
We enlisted the help of DBOps to help us look for an environmental difference between the QA and PROD SQL Servers. In the end they found a couple indexes that had been excluded from an index management job in the PROD environment only. BINGO!
This explains why we only saw the issue in PROD. It further supports the deployment being the trigger because the database deployment included index changes which caused further fragmentation of the indexes and thus slower queries.
Before you can effectively diagnose/fix issues, you have to have metrics captured from all aspects of your system(s). If you don’t have these today, you need to stop and add them so you are prepared when issues arise. Another key benefit is you can use these metrics to identify issues sooner. If our customers are telling us there is a problem, we have really failed. We should identify issues before users do.
When issues arise walk down the application/system stack from the interface down to verify each layer until you find the root cause using key experts (DBOps, ITOps, etc) whenever possible.
What do you think? Share your experiences, ideas and feedback in the comments below to continue the conversation.