CloudFlare Suffers an Hour-Long Outage while Mitigating a DDoS Attack
What started as a small-scale DDoS attack on a computer server wiped out a chunk of the Internet managed by DDoS protection company CloudFlare.
“The outage affected all of CloudFlare’s services including DNS and any services that rely on our web proxy. During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error,” writes CloudFlare co-founder and CEO Matthew Prince in a blogpost.
The outage occurred when CloudFlare edge routers failed to connect the 23 CloudFlare data centers to the rest of the Internet using routers. The CloudFlare routers could no longer announce the Internet path the data packets need to reach their destination. Some 785,000 websites, including Wikileaks, 4chan and Matallica.com suffered.
It started with a DDoS attack against the servers of one of CloudFlare customers, something that CloudFlare is extremely good at detecting and fending off. As they profiled the attack and issued a traffic routing rule (drop attack data packets of a considerable large size, between 99,971 and 99,985 bytes), they started propagating it via Juniper’s Flowspec protocol, to Juniper edge routers.
It is unknown why the rule, instead of dropping the attack traffic, drained all routers out of RAM memory, leaving them half crashed, unable to route any kind of data and also unable to serve remote management requests for a soft reboot.
With many of the edge routers unable to reboot automatically, the remaining routers got hit with the traffic across the entire CloudFlare network and got overloaded. The operation team had to manually unplug edge routers in the CloudFlare data centers, a time-consuming physical reboot that caused the hour-long outage.
“We let our customer down this morning, but we will learn from the incident and put more controls in place to eliminate problems like this in the future,” Prince said the official company announcement.