4.3B Addresses

7 Full Scans

$2,000 Total Cost

Nightcrawler: Mapping the Entire Internet

Build a distributed reconnaissance platform capable of scanning the entire IPv4 address space to understand the global attack surface — what's exposed, what's vulnerable, and what the internet actually looks like at scale.

The Origin

I wrote my first exploit when I was 14. A remote code execution against sendmail — it gave me my first root shells. Security and hacking fascinated me from that moment.

In 1997, I read "The Art of Port Scanning" in Phrack magazine. Nmap became my swiss army knife. I installed Linux at 15 just to hack. The good old days.

Two decades later, I had a question: What would it take to scan the entire internet?

Nmap is a powerful tool, but I don't think it was ever meant to map 4.3 billion addresses. I wanted to take it to its absolute limit — and myself in the process.

The Problem

Traditional network scanning doesn't work at internet scale:

Sequential scanning would take years. 4.3 billion addresses, even at 1000 hosts/hour, is an impossible timeline.
Raw scan data is massive. Multi-terabyte XML datasets that no database could query efficiently.
No enrichment pipeline. Scan results without geographic context, DNS correlation, or vulnerability mapping are just noise.
Worker coordination. No existing framework could distribute scan jobs across a global fleet of workers with proper fault tolerance.
Query latency. Security researchers need sub-second queries across billions of documents. Traditional databases couldn't deliver.

I needed to build something new.

The Journey

Year One: Building the Foundation

I was the sole contributor. What started as hacked-together scripts evolved into a full platform. The growth was rapid — necessity drove architecture.

The first challenge was worker coordination. I implemented the ZeroMQ Ventilator-Worker-Sink pattern: a central ventilator distributes jobs, workers pull on-demand, results flow to a sink. Lockless, fault-tolerant, horizontally scalable. Add workers, get proportional speed.

The Data Processing Wall

Eight months in, I completed the first full internet scan. Then came the real problem: processing the data.

I tried SQL databases first. They collapsed under the volume.

I tried a graph database. It crashed.

Then I tried Hadoop with PigLatin scripts. I remember the moment the pipeline finally ran to completion against 100 million records. Breakthrough.

The Recon Optimization

By the fourth scan, I'd developed a two-phase methodology: lightweight ping scans filter live hosts from /24 blocks before expensive service enumeration. This reduced wasted cycles by 95% with almost no loss of precision.

I could cross-reference results against the previous three scans. Patterns emerged. The internet started making sense.

The Debugging Story

During one full scan processing run, the Hadoop pipeline kept crashing a day into execution. The error was bizarre — an internal buffer overflow in a map-reduce step. I increased the buffer. It still crashed.

I spent days dissecting the source data. About a terabyte of chunked Avro files. Eliminating batches one by one until I found the culprit: a single IP in Pakistan.

Investigation revealed an FTP server with a directory listing that returned 10+ megabytes of filenames. The field was simply too large for the pipeline's assumptions. I added field trimming to the data processing and moved on.

That's internet-scale engineering: your pipeline will encounter things nobody anticipated.

The Solution

A 5-layer distributed architecture:

Layer	Technology	Purpose
Orchestration	ZeroMQ (REQ/REP, PUSH/PULL)	Job distribution with <100ms latency
Scanning	Nmap + python-libnmap	Service detection, OS fingerprinting, NSE scripts
Persistence	PostgreSQL + SQLAlchemy	Task queue with status tracking
ETL Pipeline	Apache Avro + Hadoop	Schema-enforced serialization for TB-scale processing
Search	Elasticsearch + Kibana	Geospatial indexing, sub-second queries

Data Flow:

Ventilator → Workers (N) → Sink → PostgreSQL → Avro Export → Hadoop → ES Bulk Import → Kibana

The Results

Metric	Achievement
Coverage	4.3 billion IPv4 addresses — the entire internet
Full Scans Completed	7
Time per Full Scan	~2 weeks (with 20 warmed-up workers)
Workers at Peak	25 distributed globally
Total Infrastructure Cost	~$2,000 over 18 months
Database Scale	100M+ task rows
Search Index	3B+ Elasticsearch documents
Query Latency	Sub-second across billions of records
Avro Schemas	8 data models
CLI Commands	13 specialized controllers

What I Learned

The internet is vast

There are dark corners where packets go and nothing comes back. Entire ranges that behave strangely. Patterns that only emerge at planetary scale.

The internet is fragile

The sheer volume of misconfigured services, exposed databases, and forgotten infrastructure is staggering. It amazes me that it's still operational.

The data problem is the hard problem

Scanning is straightforward. Processing terabytes of scan data, enriching it with GeoIP and CVE correlations, and making it queryable in sub-second time — that's where the real engineering lives.

With will and determination, almost anything can be accomplished

I took a tool I'd loved since 1997 and pushed it further than I ever imagined. And I pushed myself in the process.

Technical Highlights

Lockless job distribution: ZeroMQ REQ/REP eliminates database polling bottlenecks — workers pull jobs on-demand, ventilator marks dispatched atomically
Two-phase reconnaissance: Lightweight ping scans filter live hosts before expensive service enumeration, reducing wasted cycles by 95%
CPE→CVE correlation: Automatic vulnerability scoring via NVD data — extracts CPE identifiers from banners, joins against CVE database, computes CVSS scores per host
Denormalized search schema: Flattened host/service/OS data into single documents — trades storage for query performance, enabling single-query threat hunting
DTD-validated parsing: XML validation catches malformed scan results before pipeline corruption

Skills Demonstrated

✓ Distributed Systems Architecture (ZeroMQ patterns, worker fleet coordination, fault-tolerant queues)

✓ Big Data Engineering (Apache Avro, Hadoop ETL, TB-scale processing)

✓ Search Infrastructure (Elasticsearch cluster design, geospatial indexing, bulk optimization)

✓ Security Engineering (Nmap orchestration, CVE correlation, CPE extraction)

✓ Database Design (PostgreSQL with ARRAY/ENUM types, temporal status tracking)

✓ Systems Thinking (End-to-end pipeline from raw scans to queryable intelligence)

The Personal Meaning

This project represents a graduation of sorts. From a teenager writing sendmail exploits to an engineer who mapped the entire internet. From reading "The Art of Port Scanning" in 1997 to building a platform that took that art to its logical extreme.

Nightcrawler is archived now — I have bills to pay, and it wasn't generating revenue. But the lessons live on in everything I build. The instinct for scale. The comfort with distributed systems. The understanding that the internet is both more fragile and more resilient than anyone realizes.

Would I bring it back? Absolutely. With modern LLMs, the intelligence layer could go places I couldn't imagine in 2016.

Why This Matters

Nightcrawler wasn't a product. It was a proof of capability.

If you can scan 4.3 billion addresses, process terabytes of data, and make it queryable in sub-second time — you can build almost anything. The same architectural patterns apply to:

Large-scale data pipelines
Distributed task processing
Real-time analytics platforms
Security monitoring systems

This is what internet-scale engineering looks like.

This case study documents a personal security research project that successfully mapped the entire IPv4 address space seven times, demonstrating expertise in distributed systems, big data processing, and search infrastructure at planetary scale.

Have a Similar Challenge?

Whether it's distributed systems, big data pipelines, or internet-scale infrastructure — I'd love to hear about it.

zaid@algollabs.com

← Back to Case Studies