Customers Showing as Bounced on Advantage Lists
Incident Report for 14 West Product
Postmortem

Incident Summary

Affiliates noticed that records for customers whose email addresses triggered soft bounces upon email delivery attempts were incorrectly marked as inactive when they should still be marked as active. We traced the cause back to several queries in the BMETA application deployed by the Spine team which first executed its daily job on October 8, 2020. The BMETA queries were fixed to exclude soft bounces, the BMETA job was redeployed, and the incorrect records were manually fixed by a BI resource.

Leadup

The issue began on October 8, 2020, with the deployment of the new BMETA process by the Spine team.

Fault

The Snowflake queries against Blueshift bounce data in BMETA incorrectly selected soft bounce records. This caused the job to issue updates to Middleware which incorrectly marked customer records as inactive after a soft bounce occurred when delivering mail to their email address. The queries needed to be fixed to omit records associated with soft bounces.

Impact

The impact lasted several days until the issue was identified. Several affiliates were impacted due to IP warming being done at the same time, which resulted in larger than normal volumes of soft bounces against otherwise valid customer email addresses.

Detection

The issue was raised by a Support ticket that was submitted by an affiliate after they noticed very high bounce rates in SendGrid. This monitoring could have been improved by monitoring the process in greater detail, metrics were not available when the process was deployed to production but are in the process of being added.

Response

The Spine team and the BI teams responded to the incident. The BI team identified the affected records and marked the affected soft bounce customer records as active once again.

The Spine team identified the problems with the BMETA queries and deployed a fix within 24 hours. This was done despite the BMETA DEV environment being down because the OpenShift DEV on-prem cluster it was hosted on became inaccessible for about 1 day. BI also ran into issues due to members of the Advantage team debugging an issue with Advantage PROD.

Recovery

Functionality was restored at about 1:00 PM EDT on October 16, 2020, via the deployment of the BMETA job to the PROD environment in OpenShift, when its job triggered and performed the hard bounce updates for the previous business day’s Blueshift mailing data. Soft bounces were correctly unaffected and the issue was considered resolved.

Timeline

3 PM Thursday (Oct 15) – Developers and others were called into all-hands meeting after problem was identified

4 PM Thursday – Spine team developer fixes broken BMETA queries during all-hands meeting, confirms query correctness with BI resources

9 AM Friday – IT is able to fix issues preventing developers from signing into OpenShift DEV on-prem cluster, allowing BMETA DEV to be redeployed with fix and tested

10:30 AM Friday – BI resource identifies and fixes records incorrectly marked as inactive in Advantage due to soft bounces

1 PM Friday – Spine team developer deploys BMETA to PROD environment and oversees execution of daily job, confirming query changes work in PROD

2:28 PM Friday – Communication sent out that issue was resolved

10:30 AM Saturday – Testing - Spine team developer and Newshift resource collaborate to obtain and compare counts of hard bounces per affiliate in Snowflake which occurred on Friday to ensure they match counts of hard bounces reported by Blueshift campaign report

10:30 AM Sunday – Further Testing - Spine team developer and Newshift resource collaborate to obtain and compare counts of hard bounces per affiliate in Snowflake which occurred on Saturday to ensure they match counts of hard bounces reported by Blueshift campaign report

Root Cause

The root cause was traced back to several queries in the BMETA application deployed by the Spine team which first executed its daily job on October 8, 2020. The BMETA queries incorrectly selected soft bounce records. The queries were fixed to omit soft bounce records.

Recurrence

The issue has not happened before with BMETA, as the process had only been deployed on October 8, 2020.

Lessons Learned

The Spine team needs better requirements from others since the distinction between hard and soft bounces was not present or mentioned by anyone in the requirements. Further testing could have been performed of the BMETA queries to ensure correctness, though only to a point – the issue was with business logic including soft bounces. Without knowing not to include soft bounce records, testing could not have caught the issue.

The fix on the Spine team side was timely but could have been same-day if there hadn’t been a problem with IT’s OpenShift DEV on-prem cluster to enable testing of BMETA DEV prior to deployment to PROD. Instead, the Spine team had to wait until the DEV cluster came back up to test the fixes in DEV before deploying to PROD.

Corrective actions

Corrective actions have already been taken to fix the query. To ensure it won’t happen again, the MarTech team is looking into defining a threshold number of hard bounces such that if enough of them occurred in one day, relevant folks would be alerted and be able to investigate the data. Though this would not have an impact on the execution of the BMETA job, as it runs once per day and processes 100% of the previous day’s records. The Spine team is also working to add deeper monitoring to the BMETA job.

Posted Oct 30, 2020 - 09:35 EDT

Resolved
The issue causing customers to be incorrectly flagged as “Bounced” has now been resolved. This was caused by a bug in the process that updates Advantage with mailing events. The team has implemented a fix for that bug to prevent the issue going forward and has repaired all incorrect data.

BUSINESS IMPACT

All customer records should now be accurately reflecting their status. We will be conducting a full root cause analysis and postmortem to ensure that this issue does not happen again.

For updates in real-time, you can check here at https://14west.statuspage.io/.

If you have any questions, concerns, or comments for the team, please email globalcommandcenter@14west.us.

Thank you,
Global Command Center
Posted Oct 16, 2020 - 14:38 EDT
Update
We have begun the process of correcting the affected records in Advantage and expect to have all data repaired by 1 PM ET. Our next step is to implement the permanent solution to this issue to prevent additional incorrect data from getting entered into the system. This solution will process the incorrect data that comes into Advantage after 10:00 AM ET today.

BUSINESS IMPACT

At this time, customer records that were incorrectly flagged as “Bounced” prior to 10:00 AM ET today are being fixed. No customers on paid lists were incorrectly flagged. Customers who have been incorrectly marked as “Bounced” after 10:00 AM ET today will be corrected by the permanent solution being implemented later today and should be corrected by the end of the day. We will provide another update later this afternoon.

For updates in real-time, you can check here at https://14west.statuspage.io/.

If you have any questions, concerns, or comments for the team, please email globalcommandcenter@14west.us.

Thank you,
Global Command Center
Posted Oct 16, 2020 - 12:05 EDT
Identified
We have found that the process that feeds data into Advantage is causing some customers to be incorrectly flagged as "Bounced" for lists, AMB, PRO, and CIR. The team is currently working to resolve the root cause and restore all data to its correct state.

At this time, customers incorrectly marked as "Bounced" will not receive emails. The team has developed a solution to this issue and it is currently being tested. We are also actively working to repair all incorrect data and hope to have that completed by the end of the day tomorrow. We will provide another update on our progress tomorrow morning.

For updates in real-time, you can check here at https://14west.statuspage.io/.

If you have any questions, concerns, or comments for the team, please email globalcommandcenter@14west.us.

Thank you,
Global Command Center
Posted Oct 15, 2020 - 17:18 EDT
Investigating
We are currently investigating reports of customers being incorrectly marked as "Bounced" on Advantage lists. We are working to resolve the issue as quickly as possible and will provide an update when we have more information.

If you have any questions please contact us at GlobalCommandCenter@14West.US.

Thank you,
Global Command Center
Posted Oct 15, 2020 - 14:44 EDT
This incident affected: Blueshift.