Telecom network outages: causes, effects, and remedies for telecom providers & IT enterprise
Network outages, historically caused by misconfigurations, software defects, or hardware failures, are increasingly disruptive for several reasons, such as hyper-connectivity, single points of failure and over-reliance on concentrated hyperscaler cloud infrastructures. This leads to an expanded “blast radius” from single points of failure. The latest Cloudflare outage reveals that enterprises heavily reliant on a dangerously few major IT providers face critical single points of failure, leading to authentication issues, lost revenue, and broken customer experiences.
Cloudflare is a global cloud services and cybersecurity firm. It provides data centers, website and email security, protection from data loss and defences against cyber threats, among other things. It describes itself as providing an “immune system for the internet”, with technology that sits between its clients and the wider world that blocks billions of cyber threats daily. It also uses its global infrastructure to speed up internet traffic. It makes more than $500m – a quarter from nearly 300,000 customers operating in 125 countries, including China. Users of several heavy-traffic websites reported that they went offline at the same time as the Cloudflare outage.
Akamai’s Reuben Koh advocates for a distributed compute and edge architecture, which acts as autonomous cells to mitigate systemic risk and improve resilience via graceful degradation. He also suggests adopting strategies like graceful degradation and diversifying cloud providers which help telecom operators and other organizations limit the spread of outage disruption.

Computer outage, error or failure causing by software update mistake, operating system crash or cyber attack, server down or technical issue concept, people victims looking at computer laptop outage.

- Systemic Risk Assessment: Regulations, such as the EU’s Digital Operational Resilience Act (DORA) which is effective from January 17, 2025, are moving from assessing a single firm’s risk to evaluating the broader market impact of a critical third-party provider failure. DORA specifically designates critical ICT third-party providers subject to direct oversight.
- Operational Resilience Mandates: Jurisdictions are pushing firms to demonstrate the ability to maintain operations or safely exit a non-performing CSP relationship. This includes requirements for robust contingency and exit plans.
- Geographic Examples:
–Singapore is framing cloud infrastructure as essential national computing, issuing specific resilience guidelines.–Australia has issued warnings to financial institutions regarding over-dependence on a narrow set of US-based hyperscalers.–Japan is tightening scrutiny and expectations around managing third-party cloud risks.
- Continuous Discovery & Inventory: Telecom operators must maintain an up-to-date, comprehensive inventory of all APIs (managed, unmanaged, “shadow,” and “zombie”) across the enterprise.
- Shift-Left Security: Integrate security testing and design principles early into the software development lifecycle to identify and remediate vulnerabilities before APIs reach production environments.
- Implement Zero Trust Architecture (ZTA): Adopt a “never trust, always verify” approach, assuming an attacker may already be internal. This means applying strict authentication and authorization controls at the API level, not just the network perimeter.
- Strong Authentication and Authorization: Use robust mechanisms like OAuth 2.0 and OpenID. Connect, employing the principle of least privilege to ensure entities only have the minimum necessary access.
- Runtime Protection and Monitoring: Implement API gateways for centralized traffic management, rate limiting to prevent Denial-of-Service (DoS) attacks, and use behavioral analytics to detect anomalous activity indicative of abuse.
- Input Validation and Data Handling: Strictly validate and sanitize all data inputs to prevent injection attacks, and ensure APIs only expose necessary information to minimize data leakage.
- Human Oversight in AI: As AI and automation increase, maintain robust human oversight in change management and incident response, as AI systems can behave unpredictably. Telecom staff should be closely involved in change management and incident response, even as network automation increases.
Note 1. The UK’s Telecommunications Security Act – 2021 is a landmark law establishing mandatory, tough security standards for public telecom networks, making cybersecurity a legal duty for providers to protect critical infrastructure. It empowers regulator Ofcom, introduces penalties for non-compliance (up to 10% of turnover), and mandates adherence to specific security measures in the Code of Practice (CoP) through phased deadlines, requiring strong governance, supply chain security, and proactive threat management.
……………………………………………………………………………………………………………………………………………………………….
Conclusions:
Koh advocates for the implementation of resilient network architectures and improved operational maturity to enhance system fault tolerance. Key steps include distributed design, optimized operational protocols, comprehensive network visibility, and pragmatic capacity planning. These measures are becoming increasingly important as telecommunications infrastructure underpins essential societal functions.
………………………………………………………………………………………………………………………………..
References:
What telecom operators can learn from recent network outages
Cloudflare outage highlights enterprise infrastructure dependence
https://www.techtarget.com/whatis/feature/8-largest-IT-outages-in-history
España hit with major telecom blackout after power outage April 28th
Comcast frequent, intermittent internet outages + long outage in Santa Clara, CA with no auto-recovery!
AT&T wireless outage effected more than 74,000 U.S. customers with service disruptions lasting up to 11 hours for some
Rogers Telecommunications restores service after 19 hour outage disrupting life in Canada
GSMA, ETSI, IEEE, ITU & TM Forum: AI Telco Troubleshooting Challenge + TelecomGPT: a dedicated LLM for telecom applications
Rogers Telecommunications restores service after 19 hour outage disrupting life in Canada

