6 Months
22 Apps
70-80% Cost Reduction

FlashAcademy: From Ransomware Incident to Modern Platform

Rescue and rebuild a decade-old EdTech infrastructure after a security breach — while the platform serves hundreds of schools across the globe.

The Client

FlashAcademy is a UK-based EdTech company whose app helps K-12 students learn English as a Second Language (EAL). Used by schools across the world, the platform experiences significant usage spikes during term time — a traffic pattern that demands robust, scalable infrastructure.

The Crisis

I got the call after things had already gone badly wrong.

FlashAcademy's AWS SES credentials had been leaked. Spammers found them and used the account to send thousands of emails — damaging the company's sender reputation and triggering AWS alerts. The team knew they needed help, so they brought in a certified AWS partner to rebuild their infrastructure.

"On the second or third day of that engagement, the partner opened a security group."

An internet bot found the exposed staging MongoDB within hours. The database was encrypted. A ransom note appeared.

That's when FlashAcademy reached out to me. They were pleading for help. I onboarded immediately — before we'd even signed anything — closed the security group, restored the staging database from backups, and started assessing the damage.

What I found was a decade of technical debt.

The Problem

FlashAcademy's infrastructure had grown organically over 10 years, built incrementally by frontend and backend developers who weren't infrastructure specialists. Nobody's fault — it's how startups grow. But the result was a sprawling mess:

Scattered compute

Workloads ran across EC2, Amplify, Lambda, and ECS — with no consistent patterns between them. Each service had its own deployment story.

Security as an afterthought

Secrets lived in environment variables, config files, and S3 buckets. No centralized management, no rotation, no audit trail. The SES leak was a symptom, not an anomaly.

Fragile pipelines

Jenkins jobs that hadn't been touched in years. Deployments were manual, undocumented, and scary. The team's strategy was "deploy and pray."

Bleeding cash

Static EC2 instances ran 24/7 in staging — even weekends, even summers when schools were closed. Nobody knew what right-sized meant because nobody had visibility into actual usage.

Certificate chaos

A custom cert-bot setup on EC2 instances that required manual intervention whenever certificates expired. Which they did. In production. At inconvenient times.

The company's reputation was getting hammered. Operations were at stake. They needed this fixed — properly this time.

The Journey

Month 1: Triage and Foundation

Before touching anything, I mapped every service, every secret, every deployment path. Found services with no authentication. Found credentials in Git history. Found a "temporary" EC2 instance from 2019 still running.

Stood up new EKS clusters (production and staging) with proper VPC isolation and private subnets. No more public-facing databases.

Month 2-3: Containerization

This was the hardest part. Some applications had decent Docker setups. Others had none. A few were architected so poorly — legacy PHP/Apache stacks with hardcoded paths and environment assumptions — that I had to rewrite significant portions to make them container-ready.

Month 4-5: Migration

Moved applications one by one using ArgoCD sync waves:

  • Wave 0: Database migrations
  • Wave 1: Application deployments
  • Wave 2: Ingress exposure

This ordering guaranteed data integrity. Zero downtime across all 22 applications.

Month 6: Optimization and Handover

Built a custom resource optimization toolkit that queries CloudWatch Container Insights for P75/P95 usage patterns and generates ready-to-apply Kustomize patches. Identified 35-50% over-provisioning across the board.

Implemented a staging shutdown CronJob — simple idea, 70-80% cost savings. Sometimes the best optimizations are the obvious ones nobody had time to do.

The Solution

ComponentTechnologyPurpose
OrchestrationEKS v1.32Latest Kubernetes with extended support
IdentityPod IdentityAWS-native auth, replacing scattered IAM patterns
SecretsSecrets Manager + CSI DriverZero secrets in code or config files
GitOpsArgoCD + KustomizeDeclarative, auditable, boring deployments
AutoscalingKarpenter v1.0Just-in-time node provisioning
Certificatescert-manager + Let's EncryptAutomated TLS — no more 3am pages
ObservabilitySignOz + CloudWatch Container InsightsFull-stack visibility
IngressNGINX + NLBProper load balancing with health checks

The Results

MetricAchievement
Infrastructure ConsolidatedEC2 + Amplify + Lambda + ECS → Single EKS platform
Applications Managed22 (10 production, 12 staging)
Running Workloads173 pods across both environments
Secrets Centralized19 SecretProviderClass configurations
Pod Identity Roles20+ fine-grained IAM roles
Security Contexts34 deployments with hardened containers
Staging Cost Reduction70-80% (nightly & weekend shutdown)
Resource Right-Sizing35-50% additional savings
DeploymentsFrom "pray and deploy" to GitOps with drift detection
CertificatesFully automated — zero manual intervention

What Made This Work

Private networking, finally

Migrating to private VPCs was a hard nut to crack — untangling years of public-facing services, securing connections between workloads, setting up proper NAT gateways. But it meant no more "exposed MongoDB" incidents.

ArgoCD sync waves

Database migrations run before application deployments. Sounds obvious, but the previous setup didn't guarantee ordering. This alone prevented several potential outages during migration.

Karpenter consolidation

Set WhenEmptyOrUnderutilized with a 1-minute consolidation window. Watched it automatically right-size the cluster overnight. Replaced 5 static nodes with dynamic provisioning.

Containerization as forcing function

The painful process of containerizing legacy PHP/Apache apps exposed architectural problems that had been hidden for years. Fixing them made the applications more maintainable, not just more deployable.

Skills Demonstrated

Crisis Response (Immediate triage, security group lockdown, database restoration)
AWS EKS Architecture (v1.32, Pod Identity, private VPC design)
Legacy Modernization (Containerizing PHP/Apache, consolidating EC2/Amplify/Lambda/ECS)
GitOps Implementation (ArgoCD with sync waves, Kustomize overlays, drift detection)
Security Hardening (Secrets Manager CSI, Pod Identity, private networking, read-only containers)
Cost Engineering (Karpenter consolidation, scheduled shutdowns, right-sizing toolkit)

The Outcome

Six months after the ransomware incident, FlashAcademy's infrastructure went from liability to asset:

  • Security: No more scattered secrets, no more public databases, no more "certified partners" opening security groups
  • Reliability: Automated certificates, GitOps deployments, proper health checks
  • Cost: 70-80% staging reduction plus 25-30% right-sizing savings
  • Velocity: Deployments are boring now. That's the goal.

The platform now scales for term-time spikes and shrinks during holidays — automatically. The team can focus on building features for students instead of fighting fires.

This case study represents a real-world infrastructure rescue and modernization project, demonstrating expertise in crisis response, legacy migration, and modern AWS/Kubernetes patterns.

Have a Similar Challenge?

Whether it's rescuing infrastructure, migrating to Kubernetes, or fixing security issues — I'd love to hear about it.

← Back to Case Studies