Wednesday
Room 2
17:40 - 18:40
(UTC+02)
Talk (60 min)
1 Million Ways to Break Production: IoT Systems at Scale
When we work with Cloud systems, we are used to controlling the environment. We have fast and reliable networks, continuous deployment, and end-to-end observability. We can trace, monitor and debug services in real time, and roll back any broken deployments. IoT systems at scale are different. The production environment is constrained by real world physics, where devices are deployed into parking cellars with poor cellular connectivity, or outdoor garages with flaky Wi-Fi. At this scale, hardware issues show up, cables degrade and human errors happens. When we have 1 million connected devices, different classes of problems show up. Even if one device operates correctly on an individual level does not imply that the system operates correctly for 1M devices. Synchronized behavior, like devices checking for updates at midnight, quickly turns into a thundering herd that violates your service quotas and connection pools. We will talk about how our rapid growth outgrew the initial system design and caused production to go down for several days. Retry logic and home automation systems became self-inflicted DDoS attacks. Non-essential services became global bottlenecks. This talk is an experience report about the incidents and learnings from running Easee’s IoT Platform through a period where Easee scaled from 0 to 1 million chargers, in a journey of growth, near-collapse, and the way back.

