In this problem we are required to build a Storage Service that supports the following features –
● Storage should be able to store very large files (For example, In TBs)
● File Upload and Download should be possible
● Storage should be reliable and durable, and the files stored should not be lost.
● Compression and Encryption should be possible (Optional)
● Analytics should be possible (Optional)
How do we store large/ media files?
There are many systems like Facebook newsfeed and others (which allows users to write posts). These posts can include both text and media (images, videos or any other) files. We don't store the media files in the database. We only store the metadata for the post (user_id, post_id, text, timestamp, etc.). The images/media are stored on different Storage Systems; and from that particular storage system - we get a URL to access the media file. This URL is stored in the database file.
We will further discuss on how to store these large files (not only images but very large files, say a 50 TB file). A large file can be a large video file or a log file containing the actions of the users (login, logout, and other interactions with the system that includes request and response details etc.), and it can keep increasing in size.
One way to store a large file is to divide it into chunks and store the chunks on to different machines. So, suppose a 50 TB file is divided into chunks. Choosing the size of the chunk could be tricky - What will be the size of the chunks? If you divide a 50 TB file into chunks of 1 MB each, the number of parts will be
50TB / 1MB = (50 * 10^6) MB / 1 MB = 5 * 10^7 parts
From this, we can conclude that if we keep the size of the chunk very small, then the number of file parts will be very high. It can result in issues like -
Collation of the parts: Concatenating too many files/ chunks and returning them to the client will be overhead.
Cost of entries: We must keep metadata for the chunks, i.e., for a given chunk of a file, it is present on which machine. If we store metadata for every file, this will also be overhead.
HDFS
HDFS stands for Hadoop Distributed File System. Below are certain terminologies related to HDFS:
● The default chunk size is 128 MB in HDFS 2.0. However, in HDFS 1.0, it was 64 MB.
● The metadata table we maintain to store chunk information is known as the ‘NameNode Server’. It keeps mapping that chunks are present on which machine (aka Data Nodes) for a certain file. Say, for File 1, chunk 1 is present on Data Node 3 (machine).
● In HDFS, there will be only one name node server, and it will be replicated.
Now, the question could be on what basis the chunk size is selected as 128 MB.
The reason is that large file systems are built / optimized for certain operations like storing, downloading large files, or doing some analytics. And based on the types of operations, benchmarking is done to choose a proper chunk size. It is like ‘what is the normal file size for which most people are using the system’ and keeping chunk size accordingly so that system's performance is best.
For example,
Chunk size of X1 performance is P1
Chunk size of X2 performance is P2,
Similarly, doing benchmarking for different chunk sizes and then choosing the chunk size that gives the best performance.
In a nutshell, we can say benchmarking is done for the most common operations which people will be doing while using the system, and HDFS comes up with a value of default chunk size.
Making System Reliable: We know that to make the distributed system reliable, we never store data on a single machine; we replicate it on multiple machines. Here also, a chunk cannot be stored on just a single machine. To make the system reliable, chunk in stored on multiple machines. We will keep chunks on different data nodes and replicate them on other data nodes so that even if a machine goes down, we do not lose a particular chunk.
Rack Aware Algorithm: For more reliability, keep chunks data on different machines that resides on different racks so that we do not lose our data even if a rack goes down. We avoid replicating the chunks on the machines of the same rack. This is because if there comes an issue with the power supply, the rack will go down, and data won't be available anywhere else.
So, this was about chunk divisions and storing them on HDD.
Question: How the division of large input file happens into various chunks?
The answer is it depends on the use case:
● Suppose there is a client who wants to upload a large file. The client requests the App Server (Storage Service) and starts sending the stream of data. The app server on the other side has a client (HDFS client) running on it.
● HDFS also has a NameNode server to store metadata and Data Nodes to store the actual data.
● The app server will call the name node server to get the default chunk size, NameNode server will respond to it (say, the default chunk size is 128 MB).
● Now, the app server knows that it needs to make chunks of 128 MB. As soon as, the App Server collects 128 MB of data (equal to the chunk size) from the data stream, it sends the data to a data node after storing metadata about the chunk. Metadata about the chunk is stored in the name node server. For example, for a given File F1, chunk n Cn is stored in Data Node3 DN3.
● The client keeps on sending a stream of data, and again when the data received by the app server becomes equal to chunk size 128 MB (or the app server receives the end of the file), metadata about the chunk is stored in the name node server first and then chunk it send to the data node.
Briefly, the app server keeps receiving data; as soon as it reaches the threshold, it asks the name node server, 'where to persist it?', then it stores the data on the hard disk on a particular data node received from the name node server.
Few points to consider:
For a file of 200MB, if the default chunk size is 128 MB, then it will be divided into two chunks, one of 128 MB and the other of 72 MB because it is the only data App Server will be receiving for the given file before the end of the data stream is reached.
The chunks will not be saved on a single machine. The data will be replicated on various machines, and we can have a master-slave architecture where the data saved on one node is replicated to two different nodes.
We don’t expect very good latency for storage systems with large files since there is only a single stream of data.
Downloading a File
Similar to upload, the client requests the app server to download a file.
● Suppose the app server receives a request for downloading file F1. It will ask the name node server about the related information of the file, how many chunks are present, and from which data nodes to get those chunks.
● The name node server returns the metadata, say for File 1, go to data node 2 for chunk 1, to data node 3 for chunk 2, and so on. The application server will go to the particular data nodes and will fetch the data.
● As soon as the app server receives the first chunk, it sends the data to the client in a data stream. It is similar to what happened during the upload. Next, we receive the subsequent chunks and do the same.
Thank you so much for reading this article. If you found this helpful, please do share it with others and on social media. It would mean a lot to me.