We've found the root cause of the issues & have implemented fixes.
Our Kafka provider had rate limits which we had recently surpassed causing requests to hang as their system rejected our writes. We've increased the limits and have plans to migrate from a hosted solution to a BYOC solution for more bandwidth and control in the future.
The inability to publish events caused us to loose some data as there were events on device which were never able to be received on the backend. However, for revenue tracking the providers should retry webhooks and data should fill in.