FlashAcademy: From Ransomware Incident to Modern Platform
Rescue and rebuild a decade-old EdTech infrastructure after a security breach — while the platform serves hundreds of schools across the globe.
The Client
FlashAcademy is a UK-based EdTech company whose app helps K-12 students learn English as a Second Language (EAL). Used by schools across the world, the platform experiences significant usage spikes during term time — a traffic pattern that demands robust, scalable infrastructure.
The Crisis
I got the call after things had already gone badly wrong.
FlashAcademy's AWS SES credentials had been leaked. Spammers found them and used the account to send thousands of emails — damaging the company's sender reputation and triggering AWS alerts. The team knew they needed help, so they brought in a certified AWS partner to rebuild their infrastructure.
"On the second or third day of that engagement, the partner opened a security group."
An internet bot found the exposed staging MongoDB within hours. The database was encrypted. A ransom note appeared.
That's when FlashAcademy reached out to me. They were pleading for help. I onboarded immediately — before we'd even signed anything — closed the security group, restored the staging database from backups, and started assessing the damage.
What I found was a decade of technical debt.
The Problem
FlashAcademy's infrastructure had grown organically over 10 years, built incrementally by frontend and backend developers who weren't infrastructure specialists. Nobody's fault — it's how startups grow. But the result was a sprawling mess:
Scattered compute
Workloads ran across EC2, Amplify, Lambda, and ECS — with no consistent patterns between them. Each service had its own deployment story.
Security as an afterthought
Secrets lived in environment variables, config files, and S3 buckets. No centralized management, no rotation, no audit trail. The SES leak was a symptom, not an anomaly.
Fragile pipelines
Jenkins jobs that hadn't been touched in years. Deployments were manual, undocumented, and scary. The team's strategy was "deploy and pray."
Bleeding cash
Static EC2 instances ran 24/7 in staging — even weekends, even summers when schools were closed. Nobody knew what right-sized meant because nobody had visibility into actual usage.
Certificate chaos
A custom cert-bot setup on EC2 instances that required manual intervention whenever certificates expired. Which they did. In production. At inconvenient times.
The company's reputation was getting hammered. Operations were at stake. They needed this fixed — properly this time.
The Journey
Month 1: Triage and Foundation
Before touching anything, I mapped every service, every secret, every deployment path. Found services with no authentication. Found credentials in Git history. Found a "temporary" EC2 instance from 2019 still running.
Stood up new EKS clusters (production and staging) with proper VPC isolation and private subnets. No more public-facing databases.
Month 2-3: Containerization
This was the hardest part. Some applications had decent Docker setups. Others had none. A few were architected so poorly — legacy PHP/Apache stacks with hardcoded paths and environment assumptions — that I had to rewrite significant portions to make them container-ready.
Month 4-5: Migration
Moved applications one by one using ArgoCD sync waves:
- Wave 0: Database migrations
- Wave 1: Application deployments
- Wave 2: Ingress exposure
This ordering guaranteed data integrity. Zero downtime across all 22 applications.
Month 6: Optimization and Handover
Built a custom resource optimization toolkit that queries CloudWatch Container Insights for P75/P95 usage patterns and generates ready-to-apply Kustomize patches. Identified 35-50% over-provisioning across the board.
Implemented a staging shutdown CronJob — simple idea, 70-80% cost savings. Sometimes the best optimizations are the obvious ones nobody had time to do.
The Solution
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | EKS v1.32 | Latest Kubernetes with extended support |
| Identity | Pod Identity | AWS-native auth, replacing scattered IAM patterns |
| Secrets | Secrets Manager + CSI Driver | Zero secrets in code or config files |
| GitOps | ArgoCD + Kustomize | Declarative, auditable, boring deployments |
| Autoscaling | Karpenter v1.0 | Just-in-time node provisioning |
| Certificates | cert-manager + Let's Encrypt | Automated TLS — no more 3am pages |
| Observability | SignOz + CloudWatch Container Insights | Full-stack visibility |
| Ingress | NGINX + NLB | Proper load balancing with health checks |
The Results
| Metric | Achievement |
|---|---|
| Infrastructure Consolidated | EC2 + Amplify + Lambda + ECS → Single EKS platform |
| Applications Managed | 22 (10 production, 12 staging) |
| Running Workloads | 173 pods across both environments |
| Secrets Centralized | 19 SecretProviderClass configurations |
| Pod Identity Roles | 20+ fine-grained IAM roles |
| Security Contexts | 34 deployments with hardened containers |
| Staging Cost Reduction | 70-80% (nightly & weekend shutdown) |
| Resource Right-Sizing | 35-50% additional savings |
| Deployments | From "pray and deploy" to GitOps with drift detection |
| Certificates | Fully automated — zero manual intervention |
What Made This Work
Private networking, finally
Migrating to private VPCs was a hard nut to crack — untangling years of public-facing services, securing connections between workloads, setting up proper NAT gateways. But it meant no more "exposed MongoDB" incidents.
ArgoCD sync waves
Database migrations run before application deployments. Sounds obvious, but the previous setup didn't guarantee ordering. This alone prevented several potential outages during migration.
Karpenter consolidation
Set WhenEmptyOrUnderutilized with a 1-minute consolidation window. Watched it automatically right-size the cluster overnight. Replaced 5 static nodes with dynamic provisioning.
Containerization as forcing function
The painful process of containerizing legacy PHP/Apache apps exposed architectural problems that had been hidden for years. Fixing them made the applications more maintainable, not just more deployable.
Skills Demonstrated
The Outcome
Six months after the ransomware incident, FlashAcademy's infrastructure went from liability to asset:
- Security: No more scattered secrets, no more public databases, no more "certified partners" opening security groups
- Reliability: Automated certificates, GitOps deployments, proper health checks
- Cost: 70-80% staging reduction plus 25-30% right-sizing savings
- Velocity: Deployments are boring now. That's the goal.
The platform now scales for term-time spikes and shrinks during holidays — automatically. The team can focus on building features for students instead of fighting fires.
This case study represents a real-world infrastructure rescue and modernization project, demonstrating expertise in crisis response, legacy migration, and modern AWS/Kubernetes patterns.
Have a Similar Challenge?
Whether it's rescuing infrastructure, migrating to Kubernetes, or fixing security issues — I'd love to hear about it.
← Back to Case Studies