Mastering System Design: Servers, Scaling, and Latency

If you read my first blog on system design, you already know the why, why system design matters, what components a system has, and how to think about building software at scale.

But knowing the why is not enough.

Today, we go one level deeper. We talk about the actual building blocks that every large-scale system is built on top of. Things like, what even is a server? What is latency and why does it matter? How do companies like Netflix and Swiggy handle millions of requests without their systems crashing?

These are not just interview questions. These are real engineering problems that real teams solve every single day.

So let’s get into it. No jargon, no over-complication. Just clean explanations with real examples.

Why You Should Study System Design (Even if You’re a Fresher)

Let’s start with a very honest question, why should a beginner care about system design at all?

Here’s a scenario most developers go through.

You build a project in college. A simple backend in Node.js or Django. A database. A basic frontend. You show it to friends, it works perfectly. You feel great.

Then you graduate, join a company, and realise the real world is completely different.

Real applications have to handle thousands of users at the same time. They have to be up 24×7. They cannot crash when traffic spikes. Data cannot be lost. Responses have to be fast, even for users sitting in different cities or countries.

Your college project was a prototype. A production system is a different beast entirely.

And that is exactly why system design matters, it teaches you how to think beyond just writing code. It teaches you how to build systems that actually survive the real world.

Even if you are a beginner today, learning system design early will change how you think about software. You will stop thinking in functions and start thinking in systems. That shift in thinking is what separates a good engineer from a great one.

What is a Server? (Really)

Okay, let’s start from absolute zero.

A server is simply a computer, a physical machine, that runs your application code and responds to requests coming from users.

When you run your Node.js application on your laptop, it runs on http://localhost:8080. That localhost is just a fancy name for your own machine. The IP address behind it is 127.0.0.1 – your laptop’s address on its own network.

Now, when someone on the internet wants to access a website like https://swiggy.com, here is what actually happens under the hood:

Your browser takes swiggy.com and sends it to a DNS resolver (Domain Name Service). Think of DNS like a phone directory, it converts human-readable names like swiggy.com into actual IP addresses like 13.245.88.21.
Your browser now has the IP address. It sends the request directly to that IP.
The server at that IP receives the request. But a server can run multiple applications at the same time (just like your laptop runs Chrome, Spotify, and VS Code together). The server uses the port number to figure out which application should handle the request.
The correct application processes it and sends back a response.

So when you type https://swiggy.com, you are essentially typing 13.245.88.21:443 – where 443 is the default port for HTTPS.

People don’t memorise IP addresses, so they buy domain names and point them to their server’s IP. That is the whole magic.

How Do Companies Deploy Their Applications?

In college, your application ran on your laptop. In real life, companies rent servers from cloud providers.

The big names here are:

AWS (Amazon Web Services)
GCP (Google Cloud Platform)
Azure (Microsoft)

These cloud providers give you a virtual machine, a software-defined computer running on their physical hardware. In AWS, this is called an EC2 Instance. You can run your application on it, and since it has a public IP address, anyone on the internet can access it.

The process of putting your code from your laptop onto this virtual machine and making it live on the internet is called deployment.

Latency and Throughput:

“Two Numbers That Define Your System’s Performance“

You will hear these two words constantly in system design. Let’s make sure you understand them deeply, not just definitionally.

Latency

Latency is the time it takes for a single request to go from the client to the server and come back with a response.

It is measured in milliseconds (ms).

If you open Zomato on your phone and the food listings appear in 300ms, the latency is 300ms. If it takes 3 seconds, the latency is 3000ms, and you are already getting frustrated.

There is another related term, Round Trip Time (RTT). It means the total time for a request to leave your device, reach the server, and return. RTT and latency are often used interchangeably.

Low latency = fast system. High latency = slow, painful experience.

Throughput

Throughput is the number of requests a system can handle per second.

Every server has a limit. If your server can handle 500 requests per second, and suddenly 2000 requests come in at the same time, your server will either slow down massively or crash.

Throughput is measured in Requests Per Second (RPS) or Transactions Per Second (TPS).

A simple way to remember the difference:

Latency is how fast one car travels from point A to point B.
Throughput is how many cars a highway can handle in one hour.

You can have a fast highway with great latency but if it is narrow (low throughput), it will jam up the moment volume increases.

In an ideal system, you want low latency AND high throughput: fast responses, and the ability to handle a large number of them simultaneously.

Why Does This Matter in Real Life?

Let’s say you built a payment gateway.

If latency is high, users wait 5-6 seconds after clicking “Pay Now”. They panic, hit the button again, and create duplicate transactions.
If throughput is low, during a big sale (like Flipkart’s Big Billion Day), thousands of simultaneous payments will queue up, time out, and fail.

Both problems cost real money and real trust. This is why engineers obsess over these two numbers.

Scaling: The Most Important Concept in System Design

Here is a scenario that happens more often than you think.

A startup builds a product. It launches. Gets featured on some blog or goes viral on Twitter. Suddenly, instead of 100 users, they have 50,000 users hitting the server at the same time.

And the server crashes.

This is not a code bug. The code is fine. The problem is that the system was never designed to handle that kind of load.

Scaling is the solution. It means increasing your system’s capacity to handle more users and more traffic.

There are two fundamentally different ways to scale, Vertical Scaling and Horizontal Scaling. Let’s understand both properly.

Vertical Scaling (Scaling Up or Down)

Vertical scaling means making your existing machine more powerful.

More RAM. More CPU cores. Faster storage. Bigger machine.

Imagine you have one waiter at a restaurant. Customers keep coming. To handle more customers, you train that one waiter to work faster, give them better shoes, a better notepad, a faster POS system. You are making the same person more capable, that is vertical scaling.

In technical terms, if your EC2 instance has 4 GB RAM and 2 CPUs, you upgrade it to 32 GB RAM and 8 CPUs. Same machine, more power.

When is Vertical Scaling used?

SQL databases: Most relational databases like PostgreSQL and MySQL are much easier to run on a single powerful machine than distributed across many servers. Vertical scaling is the default choice here.
Stateful applications: If your application stores some state in memory (like a session or a cache), splitting it across multiple machines gets complicated. One powerful machine keeps things simple.

The Problem with Vertical Scaling

Hardware has limits. You cannot infinitely upgrade a machine. At some point, the biggest server money can buy is not enough. And that one machine is also a single point of failure, if it goes down, your entire system goes down with it.

This is why vertical scaling alone is never the complete answer.

Horizontal Scaling (Scaling Out or In)

Horizontal scaling means adding more machines instead of making one machine bigger.

Going back to the restaurant example, instead of training one super-waiter, you hire more waiters. More staff, more capacity. That is horizontal scaling.

In technical terms, instead of upgrading one EC2 instance, you add more EC2 instances and distribute the traffic across all of them.

But here is the question, how do users know which server to go to?

Users are not technical. You cannot ask them to remember multiple IP addresses. They should just type yourdomain.com and the system should handle the rest.

This is where the Load Balancer comes in.

Load Balancer, The Traffic Cop of Your System

A Load Balancer sits in front of all your servers. Every request from every user goes to the load balancer first. The load balancer then decides which server should handle that request and forwards it accordingly.

Users never directly talk to your servers. They talk to the load balancer, which acts as the single entry point.

The load balancer uses different strategies to decide which server gets the next request:

Round Robin: Distribute requests one by one in rotation. S1 gets request 1, S2 gets request 2, S3 gets request 3, back to S1. Simple and fair.
Least Connections: Send the next request to whichever server is currently handling the fewest active requests. Smarter than round robin because not all requests take the same time to process.
IP Hashing: The same user always gets sent to the same server. Useful when session data is stored locally on the server.

Benefits of Horizontal Scaling with a Load Balancer:

No single point of failure, if one server crashes, the others keep handling traffic
Easy to add or remove servers based on demand
You can scale to virtually unlimited capacity by just adding more machines

The Trade-off:

Horizontal scaling introduces complexity. If your application stores user session data in memory on the server, and the user’s next request goes to a different server, that session data is missing. You have to redesign your application to be stateless, all state lives in a shared database or cache, not on the individual server.

This is a critical design decision that affects your entire architecture.

Auto Scaling, Letting the System Scale Itself

Manual scaling has a big problem, humans are slow.

Imagine it is midnight and suddenly a news article about your product goes viral. Traffic spikes 10x in 5 minutes. By the time your team wakes up, sees the alerts, and manually adds servers, half your users have already left because your site was down.

Auto Scaling solves this. It is a mechanism where your infrastructure automatically adds or removes servers based on real-time traffic and resource usage, without any human intervention.

Here is how it works in practice:

Let’s say you define a rule: “If the CPU usage of my servers goes above 70% for more than 2 minutes, automatically launch one more server and add it to the load balancer.”

Similarly: “If CPU usage drops below 20% for 10 minutes, terminate one server to save cost.”

The Auto Scaling system monitors these metrics continuously. When the threshold is crossed, it automatically provisions or terminates servers. This entire process happens in minutes, not hours.

The Cost Angle: Why Auto Scaling Saves Real Money

Let’s say your application needs 10 servers during normal hours and 100 servers during peak hours (like dinner time for a food delivery app).

Option 1: Keep 100 servers running always, you always have capacity, but you pay for 90 servers that are sitting idle for most of the day.

Option 2: Auto Scaling, during off-peak hours, only 10 servers run. During dinner time, it scales up to 100 automatically. You only pay for what you use.

In cloud computing, you are billed per hour per server. For a mid-size company with 50 servers, auto scaling can save lakhs of rupees every month in infrastructure costs.

How Do You Know Where to Set the Threshold?

This is a great question that most beginners miss.

You do not guess. You do Load Testing.

Load testing means simulating high traffic on your system before it goes live, and observing at what point your server starts degrading, CPU goes above 80%, response times double, errors appear.

That point tells you: “One server can handle X users before it starts struggling.” You set your auto scaling threshold a little before that point, say 70% CPU, so a new server is ready before users start experiencing slowness.

Tools for load testing include k6, Locust, JMeter, and Artillery.

Putting It All Together, A Real World Picture

Let’s combine everything we have learned and trace a real request through a properly designed, scalable system.

You open Swiggy on your phone and search for “biryani near me.”

Your app sends a request to api.swiggy.com
DNS resolves this to the IP address of Swiggy’s Load Balancer
The load balancer picks the least-loaded server from a pool of, say, 50 servers and forwards your request
The server processes your request, queries the database for nearby restaurants, applies filters, ranks results
The response comes back through the load balancer to your phone
This entire round trip happens in under 200ms, that is low latency
While your request is being processed, 100,000 other users are simultaneously making their own requests, the load balancer is distributing all of them, that is high throughput
At 8 PM, dinner rush hits. CPU across all servers spikes. Auto Scaling kicks in, adds 20 more servers automatically. Load balancer starts sending traffic to them as well. Users notice nothing.

This is not magic. This is careful system design.

Key Takeaways from This Blog

Let me summarise the core ideas before we close:

Servers are machines that run your application. In the real world, you rent them from cloud providers like AWS as virtual machines (EC2 instances).

Latency is how fast a single request gets processed. Lower is better.

Throughput is how many requests your system can handle per second. Higher is better.

Vertical Scaling means making one machine more powerful. It has limits and creates a single point of failure.

Horizontal Scaling means adding more machines. It requires a Load Balancer to distribute traffic, and it requires your application to be stateless.

Auto Scaling means the system automatically adds or removes servers based on traffic patterns, saving cost and preventing outages.

What’s Coming Next?

In the next blog, we will cover:

Back-of-the-envelope estimation: how to estimate how many servers, how much storage you need before even writing a line of code
CAP Theorem: one of the most important and misunderstood ideas in distributed systems
Database Scaling: indexing, partitioning, master-slave architecture, and sharding

These topics are where system design really starts getting interesting, and also where most interviews go deep.

If you have any questions or want me to explain anything in more detail, drop a comment. I read every single one.

Follow for the next part of this series. And if this helped you, sharing it with one friend who is preparing for system design interviews is the best thing you can do.

Servers, Latency, Scaling & Auto Scaling – The Real Foundations of System Design