Blog - SlamCDN

⚡

Engineering

How We Reduced Global Cache Purge Time to Under 150ms

A deep dive into the distributed messaging architecture behind SlamCDN's instant purge system, including the challenges of propagating invalidations across 95+ PoPs in under 150 milliseconds.

February 18, 2026 · 12 min read

Read Article

The Problem: Cache Purge at Global Scale

When a customer issues a cache purge request, every edge node in our network needs to know about it — fast. A stale asset served after a purge request creates real problems: outdated pricing pages, wrong product images, or security patches that haven't propagated. For our customers, purge latency isn't a nice-to-have metric. It's a business requirement.

When SlamCDN first launched with 30 PoPs, our purge system was simple. The API server received a purge request, wrote it to a database, and each edge node polled that database every 5 seconds. Average purge propagation was around 8 seconds. Good enough for 30 nodes. Completely unacceptable at 95+.

Why Polling Doesn't Scale

The polling model had three fundamental problems:

Latency floor: With a 5-second polling interval, the best-case propagation time was 0 seconds (if a node polled right when the purge landed) and worst-case was 5 seconds. Average was 2.5 seconds of pure waiting.
Database load: 95 PoPs polling every 5 seconds means 19 queries per second just for purge checks — and that's before replication lag and connection overhead. During purge storms (large-scale invalidations), this spiked dramatically.
Fan-out complexity: Each PoP runs multiple cache servers. A single PoP in Frankfurt has 12 cache nodes. That's not 95 endpoints to update — it's closer to 800.

The Architecture: Pub/Sub Over Persistent Connections

We replaced the polling model with a push-based pub/sub system built on persistent WebSocket connections between our control plane and every edge node. Here's how it works:

A customer sends a purge request to our API (e.g., DELETE /v1/zones/{id}/cache?url=/assets/*).
The API validates the request, writes an audit log entry, and publishes a purge event to our message broker.
The message broker fans the event out to all edge nodes subscribed to that zone's purge channel.
Each edge node receives the event, invalidates the matching cache entries, and sends an acknowledgment back.
Once all nodes acknowledge (or a timeout fires), the API resolves the purge request as complete.

<150ms

Global Propagation

99.7%

Sub-100ms Delivery

~800

Edge Nodes Updated

The Message Broker: Why We Built Our Own

We evaluated Redis Pub/Sub, Apache Kafka, and NATS before ultimately building a lightweight custom broker. The reasoning:

Redis Pub/Sub is fire-and-forget — if an edge node is momentarily disconnected, it misses the message entirely. For cache purge, we need delivery guarantees.
Kafka provides durability but introduces latency from disk writes and consumer group coordination. For purge events with a lifespan of seconds, Kafka's durability model is overkill.
NATS was the closest fit, but we needed custom routing logic — zone-level subscriptions, wildcard purge pattern matching at the broker level, and tight integration with our health-check system.

Our broker is a single-purpose Go service that holds persistent connections to all edge nodes. Messages are held in a short-lived in-memory buffer (30 seconds) to handle brief disconnections. If a node is offline longer than that, it triggers a full cache state sync on reconnect. The broker runs as a 3-node cluster across AWS us-east-1, eu-west-1, and ap-southeast-1, with each edge node connected to the nearest broker instance.

Edge-Side Invalidation

When a purge event arrives at an edge node, the actual cache invalidation needs to be fast. We use a two-tier approach:

Soft purge (default): The cache entry is marked as stale but not deleted. The next request for that resource triggers a conditional request to the origin (If-None-Match / If-Modified-Since). If the origin returns 304, we re-validate the existing cached copy without transferring the full body. This is faster for the end user and reduces origin bandwidth.

Hard purge: The cache entry is immediately deleted. The next request is a full origin fetch. Customers can choose this mode via an API flag when they need guaranteed freshness (e.g., after deploying a security fix).

Measuring Purge Latency

We measure purge propagation as the time between the API receiving the purge request and the last edge node acknowledging invalidation. Over the past 90 days, our P50 is 47ms, P95 is 112ms, and P99 is 143ms. The long tail is almost entirely attributable to our three African PoPs (Johannesburg, Lagos, Nairobi), which have higher baseline latency to the nearest broker.

"The difference between 8-second purges and 150ms purges isn't incremental. It changes what's possible. Customers can now treat their CDN cache like a real-time system rather than an eventually-consistent one."

What's Next

We're working on two improvements: regional broker instances in Africa and the Middle East to reduce tail latency for those PoPs, and a tag-based purge system that lets customers invalidate groups of related resources with a single API call (e.g., purge all assets tagged product-page). Both are on track for Q2 2026.

🛡

Security

Understanding Layer 7 DDoS Attacks and How CDNs Mitigate Them

Application-layer attacks are getting more sophisticated. Here's how our edge network identifies and absorbs them without impacting legitimate traffic.

February 10, 2026 · 8 min read

Read Article

Layer 7 vs. Volumetric Attacks

Most people think of DDoS attacks as brute-force floods — massive volumes of traffic designed to saturate a network link. Those are Layer 3/4 attacks, and they're relatively straightforward to mitigate: you absorb the traffic at the edge with enough network capacity, drop the malicious packets, and move on.

Layer 7 (application-layer) attacks are different. They target the HTTP/HTTPS layer itself, sending requests that look legitimate but are designed to exhaust server resources — CPU, memory, database connections, or application threads. A single Layer 7 request might trigger a database query, a template render, and a session lookup. Multiply that by 50,000 requests per second from a distributed botnet, and even a powerful origin server will buckle.

Why They're Hard to Stop

The challenge with Layer 7 attacks is distinguishing malicious requests from real users. The requests use valid HTTP methods, carry proper headers, and often come from residential IP addresses (compromised IoT devices, browser botnets). Simple rate-limiting by IP isn't enough — attackers rotate through thousands of IPs, and aggressive rate limits will block legitimate users on shared networks (corporate offices, universities, mobile carriers).

SlamCDN's Mitigation Pipeline

Our Layer 7 DDoS mitigation runs as a multi-stage pipeline at every edge PoP. Each request passes through these stages before reaching the customer's origin:

Reputation scoring: Every incoming IP is checked against a continuously updated reputation database. We aggregate signals from across our entire network — if an IP has been involved in attacks against any SlamCDN customer, it gets a risk score. High-risk IPs are challenged or rate-limited immediately.
Behavioral fingerprinting: We analyze request patterns in real-time: request rate, header consistency, TLS fingerprint (JA3), HTTP/2 frame ordering, and navigation patterns. Real browsers have distinctive fingerprints that are difficult for attack tools to replicate perfectly.
Adaptive rate limiting: Rather than a static requests-per-second threshold, our rate limiter adapts to each customer's normal traffic patterns. We establish a baseline over 7 days and flag deviations. A site that normally receives 500 req/s getting 15,000 req/s from a single ASN triggers automatic mitigation.
JavaScript challenge: For suspicious traffic that passes the first three stages, we can inject a lightweight JavaScript challenge that verifies the client is a real browser with a functional JS engine. This stops most headless scripts and simple HTTP clients. The challenge is designed to complete in under 200ms on modern browsers.

50+ Tbps

Scrubbing Capacity

<5ms

Detection Latency

Extra Cost

A Real Attack: What 2M req/s Looks Like

In January 2026, one of our e-commerce customers was targeted by a sustained Layer 7 attack peaking at 2.1 million requests per second. The attack used a botnet of approximately 180,000 residential IPs, each sending only 10-12 requests per second — well below any per-IP rate limit that wouldn't also block real users.

Our behavioral fingerprinting caught it within 3 seconds. The attack traffic had two telltale signals: all requests used identical TLS fingerprints (the botnet nodes were running the same HTTP client library), and the request timing was uniformly distributed (real user traffic is bursty, not uniform). We applied targeted rate limits to the specific JA3 fingerprint and the attack was fully mitigated without a single request reaching the customer's origin.

What You Can Do

DDoS mitigation is included at no extra cost on all SlamCDN plans. It's always on — there's nothing to configure. For customers who want more control, our Edge Rules engine lets you create custom rules: block specific countries, require challenges for certain paths, or set per-endpoint rate limits. Enterprise customers also get access to our security team for custom rule tuning during active incidents.

🚀

Product

Introducing Origin Shield: Reduce Origin Load by 95%

Origin Shield adds a mid-tier caching layer that collapses requests from all PoPs into a single origin fetch. Now generally available.

January 23, 2026 · 5 min read

Read Article

The Origin Overload Problem

Here's a scenario every CDN customer eventually faces: you have a resource that's cached at each of your CDN's edge PoPs independently. When the cache expires, every PoP that receives a request for that resource simultaneously fetches it from your origin server. With 95 PoPs, a single cache expiration event can generate 95 concurrent requests to your origin for the exact same file.

This is called the "thundering herd" problem, and it gets worse with more PoPs. For customers with short cache TTLs (common for news sites, API responses, or dynamic content), origin load scales linearly with the number of edge nodes. Your CDN is supposed to reduce origin load, but without request collapsing, adding PoPs can actually increase it.

How Origin Shield Works

Origin Shield introduces a mid-tier caching layer between the edge PoPs and your origin. Instead of 95 PoPs each talking to your origin independently, they route cache-miss traffic through a designated Shield node. The Shield node is the only thing that talks to your origin.

The flow works like this:

A user in Tokyo makes a request. The Tokyo edge PoP checks its local cache — miss.
Instead of going to your origin, Tokyo sends the request to the Shield node (e.g., in Singapore).
The Shield node checks its cache. If it has the resource, it returns it immediately. If not, it fetches from your origin.
Crucially, if the Shield node is already fetching that resource for another PoP, it coalesces the request — Tokyo simply waits for the in-flight fetch to complete and receives the same response.

95%

Origin Load Reduction

Shield Locations

<10ms

Added Latency

Choosing a Shield Location

We offer 12 Shield locations across North America (Ashburn, Chicago, Los Angeles, Toronto), Europe (Frankfurt, London, Amsterdam, Paris), and Asia Pacific (Singapore, Tokyo, Sydney, Mumbai). The general rule: choose a Shield location that's close to your origin server. If your origin is in AWS us-east-1, pick Ashburn. If it's in eu-west-1, pick London or Amsterdam.

For customers with multiple origins, we support multi-shield configurations — different paths can route through different Shield locations, so your /api traffic routes through the Shield closest to your API server and /static routes through the one closest to your storage backend.

Real-World Results

During the beta period, we measured origin traffic reduction across 240 participating zones:

Median origin load reduction: 93% (compared to the same zones without Shield enabled)
Cache hit ratio at Shield tier: 89% on average
Request coalescing rate: 34% of Shield-tier cache misses were served from an in-flight fetch (meaning those requests would have been duplicate origin fetches without coalescing)
Added latency: P50 was 4ms, P99 was 18ms — the extra hop to the Shield node is barely noticeable

Availability and Pricing

Origin Shield is now generally available on Professional ($5.99/mo) and Enterprise plans. It's included in the plan cost — there's no per-request or per-GB surcharge. You can enable it from the dashboard under Zone Settings > Origin Shield, or via the API with a single configuration flag.

📈

Performance

Benchmarking CDN Latency: SlamCDN vs. CloudFront vs. Cloudflare

We ran 10 million synthetic requests from 50 locations over 30 days. Here's what we found about real-world CDN performance.

January 14, 2026 · 10 min read

Read Article

Methodology

CDN performance claims are everywhere, but independent, reproducible benchmarks are rare. We set out to create one. Our goal wasn't to cherry-pick scenarios where SlamCDN wins — it was to provide honest, useful data about real-world CDN latency across different regions and asset types.

Here's our test setup:

Test agents: 50 lightweight VPS instances distributed across 50 cities on 6 continents, running on a mix of AWS, DigitalOcean, Vultr, and Hetzner to avoid provider bias.
Test assets: Three cache-warmed files — a 1KB JSON response, a 100KB JavaScript bundle, and a 1MB image — hosted identically on SlamCDN, AWS CloudFront, and Cloudflare.
Request pattern: Each agent sent one request per minute to each CDN for each asset size, cycling through all combinations. That's ~6,480,000 data points per CDN over 30 days.
Measurement: Full TTFB (Time to First Byte) measured from the agent, including DNS resolution, TCP handshake, TLS negotiation, and server processing time.

Global Results (All Regions Combined)

22ms

SlamCDN P50

28ms

CloudFront P50

24ms

Cloudflare P50

Across all regions and asset sizes, SlamCDN's global P50 TTFB was 22ms, compared to 28ms for CloudFront and 24ms for Cloudflare. The P95 numbers were 48ms (SlamCDN), 71ms (CloudFront), and 52ms (Cloudflare). P99 was 89ms, 134ms, and 96ms respectively.

Regional Breakdown

North America: All three CDNs performed within 3ms of each other at P50. Dense PoP coverage means routing is efficient across the board. SlamCDN: 14ms, CloudFront: 16ms, Cloudflare: 15ms.

Europe: Similar story — tight clustering at P50. SlamCDN: 16ms, CloudFront: 19ms, Cloudflare: 17ms. The gap widened at P95, where CloudFront showed more variance (62ms vs. 38ms for SlamCDN).

Asia Pacific: This is where differences emerged. SlamCDN's P50 was 28ms, Cloudflare was 31ms, and CloudFront was 42ms. CloudFront's higher latency in APAC correlates with its sparser PoP distribution in Southeast Asia and Oceania.

South America & Africa: All three CDNs showed higher absolute latency (40-65ms P50), which is expected given network infrastructure realities. SlamCDN had a slight edge in South America due to PoPs in Sao Paulo, Buenos Aires, Santiago, and Bogota. Africa was the weakest region for all CDNs — SlamCDN (52ms), Cloudflare (48ms), CloudFront (61ms).

What the Numbers Don't Tell You

Raw TTFB is only one dimension of CDN performance. Factors we didn't measure include: cache hit ratios in production (dependent on traffic patterns and configuration), origin fetch performance, purge speed, time-to-first-byte for uncached dynamic content, and video streaming quality. A CDN that's 5ms faster on cached static assets but 500ms slower on cache misses might be worse for your use case.

"We're not claiming SlamCDN is the fastest CDN in every scenario. We're sharing real data so customers can make informed decisions based on their traffic geography and performance requirements."

Raw Data

We've published the full dataset (18.6 million data points) as a public CSV download. The measurement agents' source code is also open-source. We encourage other CDN providers and independent researchers to validate our methodology and run their own benchmarks.

🌐

Infrastructure

Our Journey to 95 PoPs: Lessons From Scaling a Global Network

From 8 edge nodes to 95 PoPs across 6 continents — the technical and operational lessons we learned building a global CDN.

December 20, 2025 · 15 min read

Read Article

The First 8: Proof of Concept

SlamCDN started in 2022 with 8 edge nodes — all in North America, all running on rented bare-metal servers in Equinix data centers. The architecture was simple: Nginx as a reverse proxy with disk-based caching, a central API for configuration, and a cron job that synced config files to each node every 60 seconds.

It worked. It was also brittle, manually maintained, and entirely dependent on one engineer SSHing into servers to debug issues at 3am. But it proved the core hypothesis: we could deliver content faster than customers hosting on a single-region cloud provider.

8 to 30: Automating Everything

The jump from 8 to 30 PoPs in 2023 forced us to automate or die. We built three things that year:

Declarative node provisioning: A Terraform-based system that could deploy a fully configured edge node in any supported data center in under 20 minutes. One config file, one command, one new PoP.
Centralized configuration: Replaced cron-based config sync with an event-driven system. Config changes propagate in seconds, not minutes.
Automated health checks and failover: Each node reports health metrics every 10 seconds. If a node fails health checks, our Anycast routing system withdraws its BGP announcement within 30 seconds, routing traffic to the next-nearest healthy PoP.

30 to 60: Going International

Expanding beyond North America in 2024 introduced problems we hadn't anticipated:

Data sovereignty: Some customers require their content to be cached only in specific regions. A European media company couldn't have their user data cached in servers outside the EU. We built zone-level geo-restriction rules that control which PoPs can cache content for a given zone.

Peering agreements: In North America, peering with major ISPs is relatively standardized. In Southeast Asia, every country has different dominant ISPs, peering policies, and interconnect locations. Our network team spent months building relationships with regional ISPs and IXPs (Internet Exchange Points) in markets like Indonesia, Thailand, and the Philippines.

Hardware variability: Not every data center offers the same hardware. In developed markets, we run NVMe-backed caching on high-spec servers. In emerging markets, we sometimes work with older hardware, spinning disks, and less reliable power. We built our caching software to degrade gracefully — it detects disk speed and automatically adjusts caching strategies.

60 to 95: Diminishing Returns and Strategic Placement

Going from 60 to 95 PoPs was a different exercise than going from 8 to 60. The first 60 PoPs covered the vast majority of global internet users. Each additional PoP beyond that had diminishing returns in terms of latency reduction — but for specific customer use cases, they were essential.

We developed a data-driven model for PoP placement that weighs:

Customer traffic volume originating from the region
Current latency from the region to the nearest existing PoP
ISP peering availability at candidate data centers
Operational cost (power, bandwidth, hardware) relative to the latency improvement
Regulatory requirements (some markets require in-country data handling)

This model told us, for example, that adding a PoP in Osaka (Japan's second city) would reduce P95 latency for Japanese users by 12ms, because Kansai-region traffic was routing through Tokyo, adding an unnecessary 8ms network hop. For a gaming customer with millions of Japanese users, that 12ms mattered.

Lessons Learned

1. More PoPs isn't always better. A poorly connected PoP with limited bandwidth can actually hurt performance if traffic gets routed there instead of a better-connected node further away. We've decommissioned 6 PoPs over our history because the network quality didn't meet our standards.

2. Invest in observability early. At 8 nodes, you can SSH in and check logs. At 95, you need centralized metrics, distributed tracing, and automated anomaly detection. We wish we'd built our observability stack at 20 nodes instead of retrofitting it at 50.

3. Relationships matter as much as technology. Data center operators, ISP peering coordinators, and local network engineers are critical partners. The best technology in the world doesn't help if you can't get rack space in the right facility or peering at the right exchange.

"Every PoP we add makes the network a little faster and a little more complex. The engineering challenge is keeping the complexity invisible to customers while making the speed gains real."

🔒

Engineering

Why We Moved to TLS 1.3 Everywhere (and You Should Too)

TLS 1.3 eliminates a full round-trip from the handshake. Here's the performance data from our migration and how it benefits every request.

December 8, 2025 · 7 min read

Read Article

TLS Handshake: The Hidden Latency Tax

Every HTTPS connection starts with a TLS handshake — a cryptographic negotiation between the client and server that establishes the encryption parameters for the session. In TLS 1.2, this handshake requires two round-trips between client and server before any application data (the actual HTTP request) can be sent.

For a user in Sydney connecting to a server in London (160ms RTT), that's 320ms of pure handshake overhead before the first byte of content. On mobile networks where RTT can exceed 200ms, the handshake alone can take over 400ms. For a CDN whose entire purpose is reducing latency, this is unacceptable overhead.

What TLS 1.3 Changes

TLS 1.3 reduces the handshake from two round-trips to one. It achieves this by combining the key exchange and authentication steps into a single flight. For new connections, this cuts handshake latency in half.

Even better, TLS 1.3 supports 0-RTT resumption — if a client has connected to the server before, it can send encrypted application data in the very first packet, eliminating the handshake round-trip entirely. The HTTP request travels alongside the TLS handshake, not after it.

1-RTT

New Connections

0-RTT

Resumed Sessions

23%

TTFB Improvement

The Migration

Enabling TLS 1.3 across 95 PoPs wasn't a single config change. Our rollout took three months:

Phase 1 — Audit (2 weeks): We audited all active TLS connections across our network. About 4.2% of client connections were from clients that didn't support TLS 1.3 (older Android versions, legacy corporate proxies, some IoT devices). We needed to ensure these clients would gracefully downgrade to TLS 1.2.

Phase 2 — Canary rollout (4 weeks): We enabled TLS 1.3 on 5 PoPs in low-traffic regions first, monitoring for handshake failures, certificate issues, and client compatibility problems. We caught two issues: a corporate proxy that choked on TLS 1.3's encrypted SNI extension, and a middleware library that didn't handle the 1-RTT handshake correctly.

Phase 3 — Global rollout (6 weeks): After validating the canary, we rolled TLS 1.3 out to all remaining PoPs in batches of 10, with 48-hour soak periods between batches. Each batch was monitored for error rates, handshake success rates, and performance regressions.

Performance Results

After enabling TLS 1.3 network-wide, we measured the impact across 2 billion connections over 30 days:

New connections (1-RTT): Median TTFB improved by 23% globally. The improvement was proportional to client-server distance — users in regions far from PoPs saw the biggest gains.
Resumed connections (0-RTT): 62% of connections from returning visitors used 0-RTT resumption, with a median TTFB improvement of 41%.
TLS 1.3 adoption: 95.8% of client connections now negotiate TLS 1.3. The remaining 4.2% fall back to TLS 1.2 seamlessly.

0-RTT Security Considerations

0-RTT resumption introduces a known security tradeoff: the early data is not protected against replay attacks. An attacker who captures a 0-RTT packet could replay it. For idempotent requests (GET, HEAD), this is generally safe — replaying a read request doesn't change server state. For non-idempotent requests (POST, PUT, DELETE), replays could be dangerous.

Our implementation only allows 0-RTT for GET and HEAD requests. POST and other methods always require a full 1-RTT handshake. This gives our customers the latency benefit of 0-RTT for the vast majority of CDN traffic (which is GETs) without the replay risk for state-changing operations.

What You Need to Do

Nothing. TLS 1.3 is enabled by default on all SlamCDN zones. If you're using our CDN, your users are already benefiting from faster handshakes. There's no configuration needed and no additional cost.

🔥

Incident Report

Post-Mortem: EU API Latency Spike on Dec 18

A transparent look at what caused elevated API latency in our EU region, how we detected it, and what we changed to prevent it.

December 2, 2025 · 6 min read

Read Article

Incident Summary

On December 18, 2025, between 14:22 UTC and 14:47 UTC (25 minutes), API requests routed through our EU region experienced elevated response times. Median API latency in the EU increased from ~45ms to 300-500ms. Global CDN content delivery was not affected — this was isolated to the REST API control plane in our EU cluster.

Timeline

14:00 UTC — A scheduled maintenance window begins for upgrading routing table software on our EU API cluster's load balancers. The maintenance was planned to be zero-downtime with a rolling restart.
14:22 UTC — The second load balancer in the rolling restart comes online with a misconfigured routing table. A typo in the health check path causes it to mark two of six API backend servers as unhealthy, even though they're functioning normally. Traffic that would normally be distributed across six backends is now concentrated on four.
14:24 UTC — Our monitoring system fires a P2 alert: EU API P95 latency has exceeded the 200ms threshold.
14:26 UTC — On-call engineer acknowledges the alert and begins investigation.
14:31 UTC — Root cause identified: the misconfigured health check path (/healthz instead of /health) on the updated load balancer.
14:35 UTC — Engineer corrects the health check path and triggers a load balancer config reload.
14:38 UTC — The two "unhealthy" backends are re-added to the active pool. Traffic distribution normalizes.
14:47 UTC — P95 latency returns to normal levels (<50ms). Incident declared resolved.

Impact

Duration: 25 minutes
Scope: REST API requests routed through EU region only
Severity: Degraded performance (elevated latency), not an outage. All API requests returned correct responses — they were just slower.
Data loss: None
CDN delivery: Not affected. Edge content delivery operates independently of the API control plane.

Root Cause

The direct cause was a typo in a configuration file. The health check endpoint was specified as /healthz (a convention from Kubernetes) instead of /health (our actual health check endpoint). This caused the load balancer to receive 404 responses from the backends and interpret them as unhealthy.

The deeper cause was a gap in our maintenance procedure. The routing table update was tested in our staging environment, but the staging environment uses /healthz as its health check path (it runs on Kubernetes), while production uses /health. The config passed staging validation but failed in production.

What We're Changing

Config parity enforcement: We're adding automated validation that compares critical configuration values (health check paths, timeout values, backend pool sizes) between staging and production before any maintenance window. Mismatches will block the deployment.
Canary load balancer: Instead of rolling restarts across all load balancers, we'll update one load balancer first and run it in shadow mode for 5 minutes — it receives real traffic but we compare its routing decisions against the existing load balancers. Discrepancies trigger an automatic rollback.
Faster detection: We're lowering our latency alert threshold from P95 > 200ms to P95 > 100ms for the API cluster, which would have caught this incident 90 seconds earlier.

"We believe that transparency about failures is more valuable than pretending they don't happen. Every incident is an opportunity to make the system more resilient."

Apology

We're sorry for the degraded experience. While 25 minutes of elevated latency on a non-critical path may seem minor, we hold ourselves to a higher standard. Our SLA commitment is 99.99% uptime, and every minute of degradation counts. Affected Enterprise customers will receive SLA credits automatically — no need to file a support request.