Question: 1 Introduction This assignment is to implement a simple storage application using deduplication. The deduplication is based on Rabin fingerprinting. 2 System Overview You're going

1 Introduction
This assignment is to implement a simple storage application using deduplication. The deduplication is based on Rabin fingerprinting.
2 System Overview
You're going to implement a single Java program called MyDedup. MyDedup supports the following operations: upload and download.
2.1 Upload
The upload operation includes the following functions:
- Chunking. It reads in the pathname of a file and divides the input file into chunks using Rabin fingerprinting.
- Identifying unique chunks. Only unique data chunks will be uploaded to the cloud. We use a fingerprint index to record and identify unique chunks.
2.1.1 Procedures
To upload a file, MyDedup first loads a metadata file named mydedup. index, which specifies the fingerprint index that keeps all fingerprints of currently stored files and identifies the unique chunks. It then uploads the unique chunks to the cloud. It also updates the fingerprint index with the fingerprints of the new unique chunks. Finally, the up-to-date fingerprint index will be stored in the index file again for the next run, and the program quits.
We make the following assumptions.
- The file may be a binary file (e.g., VM image) of size up to 20 MiB .
- The mydedup.index file and all file recipes are uploaded to the cloud before the client quits. If mydedup.index does not exist, MyDedup starts with an empty index structure. The formats of the index file and file recipes are up to your choice.
- After a file is uploaded, the file will be immutable (no modification), but it may be deleted.
- We assume that there is no crash of the client or the cloud. We also assume that only one client is using the cloud at any time (so there is no synchronization issue).- We identify files using their upload pathnames. Different uploaded files must have different pathnames.
- Files are always uploaded/downloaded in units of containers. All unique chunks to be uploaded are packed into a container of size up to 1 MiB . Specifically, the client maintains an in-memory buffer. It adds each new unique chunk to the buffer. If adding a new unique chunk causes the buffer to go beyond container size limit, the client flushes all chunks in the buffer as a new container and uploads the container to the cloud. Note that after the client reaches the end of a file, it should always upload all chunks in the buffer as a new container to the cloud.
After each file upload, you should report the statistics. Note that the statistics are cumulative (i.e., including all files that have been stored). Specifically, we report the following details (excluding the metadata statistics):
- Total number of files that have been stored
- Total number of pre-deduplicated chunks in storage
- Total number of unique chunks in storage
- Total number of bytes of pre-deduplicated chunks in storage (denoted by \( s_{1}\))
- Total number of bytes of unique chunks in storage (denoted by \( s_{2}\))
- Total number of containers in storage
- Deduplication ratio: \( s_{1}/ s_{2}\)(rounded to two decimal places).
2.1.2 Chunking
We use Rabin fingerprinting for variable-size chunking; see lecture notes for details. In particular, we divide chunks by checking if an offset matches an anchor-point criterion. An anchor point is determined by the anchor mask with multiple 1-bits. If the bitwise AND operation of the Rabin fingerprint and the anchor mask is equal to zero, we have an anchor point.
A data chunk is defined by the byte range starting from the first byte right after the previous anchor point (or the beginning of the file) to the end of the current anchor point. While we reach the end of the file, we simply produce a chunk between the last anchor point and the end of the file, even though the length of the new chunk can be very small.
MyDedup takes the following parameters from the command line as input parameters: (i) min_chunk, the minimum chunk size (in bytes),(ii) avg_chunk, the average chunk size (in bytes), and (iii) max_chunk, the maximum chunk size (in bytes). Each chunk size parameter is required to be a power of 2; please return an error otherwise. We assume that the multiplier \( d \) is equal to 257.
Chunks are identified based on MD5.
2.2 Download
Given the pathname, MyDedup retrieves chunks (in containers) and reconstructs the original file.
2.3 Delete
Given the pathname, MyDedup deletes the file from the cloud. If a deleted chunk is no longer shared by any other file, its entry in the index structure should be removed. However, its physical copy may remain in the original container. If all chunks in a container are deleted and not shared by any other file, the container should also be physically removed from the storage backend. 2.4 Local Storage Backend
Your program supports the local storage backend, which is used to mimic a cloud storage site. All data chunks will be stored under the directory data/.
2.5 Sample Input/Output Format
Upload:
java MyDedup upload
Report Output:
Total number of files that have been stored: 10
Total number of pre-deduplicated ch
1 Introduction This assignment is to implement a

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!