Incident Summary
Affiliates noticed that records for customers whose email addresses triggered soft bounces upon email delivery attempts were incorrectly marked as inactive when they should still be marked as active. We traced the cause back to several queries in the BMETA application deployed by the Spine team which first executed its daily job on October 8, 2020. The BMETA queries were fixed to exclude soft bounces, the BMETA job was redeployed, and the incorrect records were manually fixed by a BI resource.
Leadup
The issue began on October 8, 2020, with the deployment of the new BMETA process by the Spine team.
Fault
The Snowflake queries against Blueshift bounce data in BMETA incorrectly selected soft bounce records. This caused the job to issue updates to Middleware which incorrectly marked customer records as inactive after a soft bounce occurred when delivering mail to their email address. The queries needed to be fixed to omit records associated with soft bounces.
Impact
The impact lasted several days until the issue was identified. Several affiliates were impacted due to IP warming being done at the same time, which resulted in larger than normal volumes of soft bounces against otherwise valid customer email addresses.
Detection
The issue was raised by a Support ticket that was submitted by an affiliate after they noticed very high bounce rates in SendGrid. This monitoring could have been improved by monitoring the process in greater detail, metrics were not available when the process was deployed to production but are in the process of being added.
Response
The Spine team and the BI teams responded to the incident. The BI team identified the affected records and marked the affected soft bounce customer records as active once again.
The Spine team identified the problems with the BMETA queries and deployed a fix within 24 hours. This was done despite the BMETA DEV environment being down because the OpenShift DEV on-prem cluster it was hosted on became inaccessible for about 1 day. BI also ran into issues due to members of the Advantage team debugging an issue with Advantage PROD.
Recovery
Functionality was restored at about 1:00 PM EDT on October 16, 2020, via the deployment of the BMETA job to the PROD environment in OpenShift, when its job triggered and performed the hard bounce updates for the previous business day’s Blueshift mailing data. Soft bounces were correctly unaffected and the issue was considered resolved.
Timeline
3 PM Thursday (Oct 15) – Developers and others were called into all-hands meeting after problem was identified
4 PM Thursday – Spine team developer fixes broken BMETA queries during all-hands meeting, confirms query correctness with BI resources
9 AM Friday – IT is able to fix issues preventing developers from signing into OpenShift DEV on-prem cluster, allowing BMETA DEV to be redeployed with fix and tested
10:30 AM Friday – BI resource identifies and fixes records incorrectly marked as inactive in Advantage due to soft bounces
1 PM Friday – Spine team developer deploys BMETA to PROD environment and oversees execution of daily job, confirming query changes work in PROD
2:28 PM Friday – Communication sent out that issue was resolved
10:30 AM Saturday – Testing - Spine team developer and Newshift resource collaborate to obtain and compare counts of hard bounces per affiliate in Snowflake which occurred on Friday to ensure they match counts of hard bounces reported by Blueshift campaign report
10:30 AM Sunday – Further Testing - Spine team developer and Newshift resource collaborate to obtain and compare counts of hard bounces per affiliate in Snowflake which occurred on Saturday to ensure they match counts of hard bounces reported by Blueshift campaign report
Root Cause
The root cause was traced back to several queries in the BMETA application deployed by the Spine team which first executed its daily job on October 8, 2020. The BMETA queries incorrectly selected soft bounce records. The queries were fixed to omit soft bounce records.
Recurrence
The issue has not happened before with BMETA, as the process had only been deployed on October 8, 2020.
Lessons Learned
The Spine team needs better requirements from others since the distinction between hard and soft bounces was not present or mentioned by anyone in the requirements. Further testing could have been performed of the BMETA queries to ensure correctness, though only to a point – the issue was with business logic including soft bounces. Without knowing not to include soft bounce records, testing could not have caught the issue.
The fix on the Spine team side was timely but could have been same-day if there hadn’t been a problem with IT’s OpenShift DEV on-prem cluster to enable testing of BMETA DEV prior to deployment to PROD. Instead, the Spine team had to wait until the DEV cluster came back up to test the fixes in DEV before deploying to PROD.
Corrective actions
Corrective actions have already been taken to fix the query. To ensure it won’t happen again, the MarTech team is looking into defining a threshold number of hard bounces such that if enough of them occurred in one day, relevant folks would be alerted and be able to investigate the data. Though this would not have an impact on the execution of the BMETA job, as it runs once per day and processes 100% of the previous day’s records. The Spine team is also working to add deeper monitoring to the BMETA job.