APJ Advantage Data Issue

Incident Report for 14 West Product

Postmortem

Incident Summary

We have an SSIS job server in Azure Japan named ADV-SQLJOB-P1F9. It runs a job that synchronizes advantage data between different parts of the APJ process. This job was not running and the SQL server was reporting up but was non-responsive. This caused certain data to fall out of sync until the job could be recovered and run again, and the backlog of data to be transferred was processed.

Leadup

The Azure Japan temporary drive where TempDB is located for this server became unavailable for a period of time.

Fault

When this temporary drive became available again, SQL Server does not automatically repair itself until the instance is restarted. The SSIS job running here stopped being able to run causing data to not be updated.

Impact

This occurred between 10/17 and 10/19 and data that was normally synchronized to various parts of the APJ application were not synchronized.

Detection

This issue was detected by the product owner of the SSIS job noting it had not been successfully running. This detection could be improved by adding additional monitoring on the SQL side (completed) and on the Advantage side as a secondary check that expected data is arriving.

Response

The Data engineering, DBA, APJ, and Command Center teams all assisted in the response to the incident. There were no delays once the issue was detected and resources were requested, and the turnaround was very fast.

Recovery

In order to resolve the issue, the SQL instance was restarted, which recovered the TempDB. Then the job was able to run successfully. Additionally, the server was fully restarted to ensure that it was able to handle a general failure in this manner, and this was successful. The job then was able to successfully process the backlog of data.

Timeline

10/17 - Approximate beginning of the outage
10/19 10:10 am – SSIS job failure noted
10/19 11:38 am – SQL server recovered, SSIS job re-run
10/19 12:11 pm – Issue fully resolved

Root Cause

As far as we were able to determine, the root cause was the Azure temp drive becoming unavailable to SQL Server for a period of time, and SQL Server was unable to automatically recover from it. Fixed with an instance restart and confirmed as durable with a server restart. Additional monitoring added on the DB side, with plans to cover from the Advantage app side as well, to get monitoring independent of the server instance itself.

Recurrence

This issue has not occurred before.

Lessons Learned

This incident underscored the importance of this SSIS process and made it clear that some additional monitoring and alerting is warranted. We also were able to confirm that the process is resilient enough to be able to survive an outage with no data loss.

Corrective actions

Additional monitoring implemented on the database side by the DBA team and additional monitoring will on being implemented on the APJ side by the Advantage team.

Posted Nov 03, 2020 - 11:37 EST

Resolved

The issue preventing data from being updated in the APJ instance of Advantage has now been resolved and all missing signups, unsubscribes, bounces, and spam reports have been added to Advantage. This issue was due to a problem with the server that processes this data and updates Advantage.

BUSINESS IMPACT

You should now be able to see all previously missing data and new incoming data should also be updating in Advantage. We will be conducting a full root cause analysis and postmortem to determine exactly what allowed this to happen and how we can avoid a future reoccurrence.

For updates in real-time, you can check here at https://14west.statuspage.io/.

If you have any questions, concerns, or comments for the team, please email globalcommandcenter@14west.us.

Thank you,
Global Command Center

Posted Oct 20, 2020 - 16:42 EDT

Update

We have detected an issue preventing some data from being updated in the APJ instance of Advantage. At this time, some signups, unsubscribes, bounces, and spam reports are not being reflected in the APJ instance of Advantage. This is not affecting any other business. We have successfully repaired the process and today's data is currently in Advantage. We are currently working to add the rest of the missing data that was not added between Oct. 17 and Oct. 19 and will provide another update when we have more information or the issue is resolved.

For updates in real-time, you can check here at https://14west.statuspage.io/.

If you have any questions, concerns, or comments for the team, please email globalcommandcenter@14west.us.

Thank you,
Global Command Center

Posted Oct 20, 2020 - 12:21 EDT

Investigating

We are currently investigating reports of Advantage not being updated with recent data for the APJ instance of Advantage only. Other affiliates are not impacted by this issue. We are working to resolve the issue as quickly as possible and will provide an update when we have more information.

If you have any questions please contact us at GlobalCommandCenter@14West.US.

Thank you,
Global Command Center

Posted Oct 20, 2020 - 11:34 EDT

This incident affected: Advantage.