Tinyurl system design

 

Image for post

What is TinyURL?

TinyURL is a URL shortening web service, which provides short aliases for redirection of long URLs.

TIPS: Always clarify the assumptions for use case of the system you are going to design.

Understanding the scale:

  1. UserBase: 30M users

2. Short URL length: 7 characters ( su.com/_______) [shorturl.com]

Data capacity model: How much data we insert into the system? Attributes we are gonna save are as follows:

Long URL: 2KB → 2048 chars

Short URL: 17 bytes → 17 chars

Created at: (epoch time) 7 bytes → 7 chars

Expire at: 7 bytes → 7 chars

Data per entry: → 2.031 KB

For 30 M users: 60.7 GB/Month,0.7 TB/year, 3.6 TB/5 years. Based on these data we can assume read/write our DB needs which will eventually help us to decide for RDBMS or NoSQL DB.

7 char random ID.

Image for post

We have two algorithms for implementing generating 7 chars unique ID:

  1. Base 62: Take int/long and gives out base62 output which contains int or char(small/caps), safest bet.
  2. MD5:Take unique string (long URL) and gives base62 output, gives a lengthy output of which we take only 7 chars which can result in lots of collisions. same unique id leads to data corruption.
  3. why base 62 why not base 10? if we use base 10 our unique id comprises only of ints and in base 62 it’s a combination of A-Z, a-z,0–9. Base 10 gives only 10⁷ i.e 10 Million combinations whereas BASE 62 gives 62⁷ i.e 3.5 Trillion.

BASE 62 conversion code:

Image for post

Which DB to use?

RDBMS: Gives ACID but the problem is scaling. we can use sharding to scale but sharding increases complexity when we want our system to be horizontally scaled as we have 30M users there will be lots of reads and writes.

NoSQL: The problem is Eventual consistency but highly available and easy to scale.

Now we have the mapping of our long-URL to the short-URL and we need to insert this record in the database.

long URL: www.longurl.com/1230–93908435430923somerandomnumbers

short URL: www.su.com/efrjktl

We can’t directly insert this mapping into the database. Before insert, we need to check whether the newly generated short URL already exists in our DB as we have generated a unique id from some random number, the possibility is we have already generated that random number. If already present, generate a new unique id. This works fine when we have only one system and not dealing with distributed system, with no parallel process trying to insert at one point in time.

Image for post

What if you have two different app servers and at one point in time they convert long URL A and long URL B to a short URL. However both of them got the same random number, as a result, generate same 7 char short URL, i.e we have same short URL for different long URL. Both the app server checks if the same short URL exists in DB or not and both of them insert mapping into the table. Now data is nicely corrupted.

one way to resolve that is to have a unique id constraint on a short URL and this way whoever trying to insert last will need to rework their mapping. In RDBMS, we have DB level function like put if not exists which avoid such kind of corruption. In NoSQL, it’s not possible.

Technique 2: Let’s use MD5 to resolve these issues. Now our shorl URL generator is based on M5. what happens here if we give the same long URL it always gives the same hash code (consists of caps, small letters, and numbers) i.e same short URL. From the hash code, we use only 7 char and insert them into the DB. Again there are chances of collision in the DB. The good thing is there’s always a mapping present in the DB i.e if two users are trying to access the same long URL ‘A’, chances are there that particular mapping is present in the DB and hence no need to insert. What if the long URL is different?? we need to change the input and generate the mapping and if not present insert it into the DB. Again here we can’t use NoSQL.

Technique 3: use counter, it’s thread-safe.

Image for post

This works really well until we scale as we can’t maintain one counter for all the servers.

To really resolve this we have to use the apache foundation library-zookeeper.

Image for post

Zookeeper is distributed coordination service to manage multiple host machines.

Image for post

Zookeeper names the server and assigns an ID to these servers as they are added to the distributed network, chose the master, makes note of all the active nodes and absent nodes. It also keeps some configuration information for all these hosts. Also, provide a coordinated DB-like filesystem where all these hosts can write the data. And that's how race condition is avoided.

Distributed URL shortener using ZOOKEEPER with ZERO collision

Image for post

We have already seen how to resolve the collision problem. The only issue we were facing is how to maintain the counter, from where to start the count in case a service goes down.

Let’s understand how these app servers get counter reaches?

Consider each and every app server that has a counter running in it. In this system, we have added these servers recently. How do and from where do these servers get to know from where to begin the count and when they will get out of count. As soon as added to the system, they will contact the zookeeper and register themselves in the zookeeper. Zookeeper knows okay three AS are online and these servers now ask for the ranges In the memory of ZK we have the range. out of 3.5 trillion combinations we have to break in some range. Let’s say 1L-2L is one batch and so on..they are unassigned and are free.

As soon as these services ask for range, counter ranges are assigned to each of the services. let’s say from 1L-2L is assign to A1. so when a request comes it uses one of the numbers from this range and gets converted to the shortURL. since we know that there is no possibility of collision we will straight add it into the DB → high performance, and cache this entry in the cache.

what if A3 goes down? this server has the range 3l-4l. zookeeper immediately gets to know about this. this range is unused and will be kept unused for a while (mark for already use).

If a new server, A4 gets added, it talks to the zookeeper and gets all the configuration and new range (4L-5L) what if all the counter runs out here? This means that the server knows it has no number to count, it immediately content zookeeper, zookeeper marks (4L-5L ) as used and assigned a new range to the a$.

This works seamlessly. Here we can go for any DB, RDBMS, or NoSQL, highly scalable.

That’s all about “TinyURL system design”. I hope you get the basic idea behind this. In case of any doubts and improvement feel free to comment.

Please like, share, and comment if you want to add something or if you have any queries. Stay tuned for more such blogs.

Comments