HYBRID FILE SYSTEM STORING LARGE FILES ON CLOUD
Linux File System
Course and Teammates
For this project, I implemented CloudFS, a cloud-backed local file system that uses the local SSD to store small files and Amazon S3 to store large files. To users, CloudFS has the exact same API and behavior as a linux file system. A key goal of CloudFS is to minimize cloud cost, including cost of capacity, operation, and data transfer. To minimize cloud cost, I implemented segment-level deduplication using rabin fingerprinting, and write-back caching. In addition, CloudFS supported taking snapshots of the file system and restoring from snapshots.
Due to CMU's academic integrity policy, I will not post my code. Instead, I will focus on the design principles and major design decisions I made.
Nov - Dec 2019
Scoreboard: 2nd place out of 88
Score based on total cloud usage cost
Linux system calls supported
init, destroy, getattr, getxattr, setxattr, readdir, opendir, mkdir, rmdir, mknod, open, release, read, write, truncate, unlink, link,
symlink, readlink, utimens, access, chmod, ioctl
CloudFS is essentially a wrapper application outside of a basic linux file system. We attempt to achieve the best of both worlds: the extensibility of cloud storage, and the fast access speed and low cost of local SSDs. We use the FUSE (Filesystem in Userspace) framework to run linux file system code in user space. Several key functionalities CloudFS provides on top of a linux file system are:
To save space on the local SSD, all files larger than 64 KB are stored in AWS S3. For these large files, we only store their metadata locally. When users need to access a large file, we download the large file temporarily from cloud.
To minimize cloud storage cost, we use deduplication to store duplicate content across files only once on cloud. We use Rabin Fingerprinting, which splits files into varying size segments. The splitting algorithm is content-specific and can detect duplications even with data misalignment.
We build a write-back cache on the local SSD that persists across mounts to further decrease cloud cost. I chose to use the LRU-K policy for cache-eviction.
We also support taking snapshots of the file system, and restoring/installing/deleting snapshots. Snapshots are packaged into tar files and uploaded to cloud.
This project took around 3000 lines of C to complete. Due to this being a course project, an assumption was that the system would not crash between operations, so I did not implement techniques such as write-ahead logging to support recovery from failures.
In this hybrid file system, only small files are stored locally. Large files residing on the cloud (call these cloud files) are stored locally as proxy files with the same path and file names. I use the extended attributes of linux files to maintain filesize, blockcount, and modified time of proxy files, as well as whether a file is a proxy file, how many opens are on a file (open_count), and whether a cloud file opened locally has been modified (dirty). I did not have to store the other file attributes because they will be correctly maintained by the proxy file.
When release or truncate is called on a file with open_count 0, if the file size grows above 64 KB, it will be uploaded to Amazon S3 and the local file is truncated to become a proxy file. When open is called on a cloud file, I download the file from cloud back to its local path, and and recover its extended attributes. When the cloud file is released again, if it is still larger than 64 KB, I truncate it back to a proxy file and upload to cloud if it is dirty. This whole process becomes more involved when we add deduplication.
There were a lot of edge cases to consider. For example, for read-only files that do not allow modifications to extended attributes, I temporarily change the permissions to 777 when modifying.
To save cloud storage cost, we use deduplication to store duplicate segments across files only once on cloud. We separate each file into segments by Rabin Fingerprinting, which uses rolling hash. Each segment is represented by its MD5 hash and each proxy file of some cloud file maintains a list of segment MD5 hashes. I used a hash table to store the length and reference counts (refcount) of all segments. A segment is deleted from cloud when its refcount reaches 0, and uploaded to S3 with its MD5 hash as its S3 object name when it first enters the hash table. The hash table is serialized and persisted to disk upon unmount.
The largest challenge is when accessing cloud files. It is costly to read all segments of a large cloud file back into a local file. Since all linux read/write/truncate operations provide offsets or sizes, we only download necessary segments depending on the offset from the cloud.
For example, to write bytes to a cloud file at an offset, the segment hashes strictly before the offset remain untouched, so we can leave them be. We start from the segment that previously contains the offset, and combine the old bytes before the offset with the new bytes to write. We feed these bytes to Rabin Fingerprinting repeatedly to generate new segments. Note that old bytes after the modification range are also untouched. However, the boundaries of Rabin Fingerprinting have changed, so we have to recalculate the boundaries and segment hashes for them too.
One optimization is to process segments after the modification range greedily, and to stop processing as soons as Rabin boundaries stop changing. Another optimization is to avoid writing actual file content to the proxy file, and to instead use a fixed buffer to download segment content and process one segment at a time from cloud. We track the new list of segment hashes, and update the proxy file only in the end.
Efficient processing of Rabin Fingerprinting is tricky and I wrote a small library to deal with the different cases of read/write/truncate.
To further save the cloud cost of uploading or downloading segments, we implement a local persistent write-back cache that caches segments. The cache is essentially a hidden folder stored on the SSD, where each segment in the cache is a file in the folder with its hash as the filename. All read/writes of a segment in cache gets a cache hit and avoids cloud operations. Reading a segment not in cache triggers a fetch from cloud into cache. Writing a segment not in cache inserts it into the cache. We keep a cache array of segment hashes in memory to facilitate operations.
One key design decision is the eviction policy to use when the cache is full (total segment size above 32KB). I chose LRU-K. Usually, a simple LRU policy works fine, but a main weakness of LRU is that it doesn’t take into account how many times a segment has been accessed in history. In general, at any point in a trace, a segment that was frequently accessed in the past will likely be frequently accessed in the future so we should favor keeping it in the cache. LRU-K examines the Kth last access time instead of just the last, so frequently accessed elements will likely have a recent Kth last access time and be kept in cache.
One optimization is to count close or consecutive accesses to the same segment as a single access, because of the highly correlated nature of segment accesses in CloudFS. Another optimization is to only upload segments that have refcount 1 to cloud when evicting from cache, to avoid re-uploading.
Snapshots allow us to time travel back to an earlier state of the file system. CloudFS supports several operations including taking a snapshot, installing and uninstalling a snapshot at a specific path (like mounting), restoring the whole file system to a snapshot, and deleting a snapshot.
To take a snapshot, the user of CloudFS calls the linux ioctl call with a specific argument. We first persist in-memory data structures such as the hashtable and cache array to disk. Then we traverse the whole file system using DFS and package all normal files and directory files into a tar file using libarchive.
One issue is libarchive cannot recover the extended attributes of read-only files when untarring a tar file. Therefore, we save the permissions of all files to its extended attributes and do chmod 777 before tarring them, and recover the correct permissions when untarring.
Another issue is that when attempting to recover to an older snapshot, some segments present in that snapshot might have been deleted from cloud. To prevent this, we need to increment the refcount of all segments present in a snapshot. We also need to decrement the refcount of these segments upon deleting or restoring the snapshot. Therefore, when traverse the file system to package the tar file, we create a separate file to track all segment hashes in the system, and upload this file to cloud too. We can download this file when deleting or restoring the snapshot to decrement the refcounts.
This project is challenging mainly because it is written in C. Thus we had to implement many low-level functionality on our own, and learn unintuitive libraries such as libarchive and uthash. I learned a lot of nitty-gritty details about the linux file system calls and various combinations of flags (open alone had flags such as O_APPEND/O_CREAT/O_EXCL...). I learned to write modular code when adding more and more functionality to the system such as caching and snapshotting.
There were also a lot of design decisions for us to make, many of which I discussed above. Another example of an interesting trade-off is the average segment size used for Rabin Fingerprinting.
A smaller segment size means more segments, more uploads to the cloud, as well as more metadata in the hashtable and in each proxy file to store and process, resulting in higher cloud cost. However, smaller segment size also means that there is higher probability to detect duplicate content, which might save cloud storage space. After experimentation, I chose 4KB as my average segment size, which is moderately between the 32KB cache size and 64KB large file threshold size.