Incident Summary
We have an SSIS job server in Azure Japan named ADV-SQLJOB-P1F9. It runs a job that synchronizes advantage data between different parts of the APJ process. This job was not running and the SQL server was reporting up but was non-responsive. This caused certain data to fall out of sync until the job could be recovered and run again, and the backlog of data to be transferred was processed.
Leadup
The Azure Japan temporary drive where TempDB is located for this server became unavailable for a period of time.
Fault
When this temporary drive became available again, SQL Server does not automatically repair itself until the instance is restarted. The SSIS job running here stopped being able to run causing data to not be updated.
Impact
This occurred between 10/17 and 10/19 and data that was normally synchronized to various parts of the APJ application were not synchronized.
Detection
This issue was detected by the product owner of the SSIS job noting it had not been successfully running. This detection could be improved by adding additional monitoring on the SQL side (completed) and on the Advantage side as a secondary check that expected data is arriving.
Response
The Data engineering, DBA, APJ, and Command Center teams all assisted in the response to the incident. There were no delays once the issue was detected and resources were requested, and the turnaround was very fast.
Recovery
In order to resolve the issue, the SQL instance was restarted, which recovered the TempDB. Then the job was able to run successfully. Additionally, the server was fully restarted to ensure that it was able to handle a general failure in this manner, and this was successful. The job then was able to successfully process the backlog of data.
Timeline
Root Cause
As far as we were able to determine, the root cause was the Azure temp drive becoming unavailable to SQL Server for a period of time, and SQL Server was unable to automatically recover from it. Fixed with an instance restart and confirmed as durable with a server restart. Additional monitoring added on the DB side, with plans to cover from the Advantage app side as well, to get monitoring independent of the server instance itself.
Recurrence
This issue has not occurred before.
Lessons Learned
This incident underscored the importance of this SSIS process and made it clear that some additional monitoring and alerting is warranted. We also were able to confirm that the process is resilient enough to be able to survive an outage with no data loss.
Corrective actions
Additional monitoring implemented on the database side by the DBA team and additional monitoring will on being implemented on the APJ side by the Advantage team.