Address Search

For my Big Data Systems class, I built a fault-tolerant distributed system that serves Madison property addresses through a multi-container architecture. This project introduced me to the fascinating world of building resilient systems that can gracefully handle failures.
The system consists of two layers:

Dataset Layer: Two replicated gRPC servers hosting the complete Madison address dataset, ensuring data availability even if one server fails
Cache Layer: Three HTTP servers with built-in LRU caches that distribute requests across dataset servers and implement retry logic with exponential backoff
Load balancing through round-robin request distribution between dataset replicas
Automatic failover with up to 5 retry attempts when servers go down
LRU cache of size 3 storing 8 addresses per zipcode to reduce dataset server load

The architecture uses Protocol Buffers for efficient serialization, gRPC for high-performance inter-service communication, and Flask for the HTTP interface. Everything is orchestrated with Docker Compose, making the entire system reproducible and deployable with a single command.

What fascinated me most was designing the retry logic and cache invalidation strategy. When a dataset server fails, the cache layer automatically routes requests to the healthy server after a 100ms sleep. The LRU cache ensures that even if all dataset servers are down, users can still access recently queried zipcodes, providing graceful degradation rather than complete failure.

This project opened my eyes to the complexities of distributed systems and fault tolerance. It made me reflect on BadgerBase, my course search platform serving 2000+ UW-Madison students. Currently, BadgerBase runs on a single-server architecture with a MySQL database on Railway and a Hono API, both hosted on single instances. While the GitHub Actions-based data pipelines provide some resilience through scheduled retries, the core infrastructure lacks the redundancy I implemented in this gRPC project.

Moving forward, I'm considering how to apply these fault-tolerance principles to BadgerBase. Some ideas include:

Implementing read replicas for the MySQL database to handle increased load and provide failover capability
Adding a Redis caching layer similar to the LRU cache in this project to reduce database queries
Deploying multiple API server instances behind a load balancer
Setting up health checks and automatic container restarts

Building fault-tolerant systems is incredibly exciting to me because it's about designing for reality—systems fail, networks partition, and servers crash. Creating software that anticipates and handles these failures gracefully is both an engineering challenge and an art form. This project was just the beginning of my journey into distributed systems, and I can't wait to apply these lessons to make BadgerBase more resilient for its growing user base.