Question: 1 Introduction This assignment is to implement a simple storage application using deduplication. The deduplication is based on Rabin fingerprinting. 2 System Overview You're going
Introduction
This assignment is to implement a simple storage application using deduplication. The deduplication is based on Rabin fingerprinting.
System Overview
You're going to implement a single Java program called MyDedup. MyDedup supports the following operations: upload and download.
Upload
The upload operation includes the following functions:
Chunking. It reads in the pathname of a file and divides the input file into chunks using Rabin fingerprinting.
Identifying unique chunks. Only unique data chunks will be uploaded to the cloud. We use a fingerprint index to record and identify unique chunks.
Procedures
To upload a file, MyDedup first loads a metadata file named mydedup. index, which specifies the fingerprint index that keeps all fingerprints of currently stored files and identifies the unique chunks. It then uploads the unique chunks to the cloud. It also updates the fingerprint index with the fingerprints of the new unique chunks. Finally, the uptodate fingerprint index will be stored in the index file again for the next run, and the program quits.
We make the following assumptions.
The file may be a binary file eg VM image of size up to MiB
The mydedup.index file and all file recipes are uploaded to the cloud before the client quits. If mydedup.index does not exist, MyDedup starts with an empty index structure. The formats of the index file and file recipes are up to your choice.
After a file is uploaded, the file will be immutable no modification but it may be deleted.
We assume that there is no crash of the client or the cloud. We also assume that only one client is using the cloud at any time so there is no synchronization issue We identify files using their upload pathnames. Different uploaded files must have different pathnames.
Files are always uploadeddownloaded in units of containers. All unique chunks to be uploaded are packed into a container of size up to MiB Specifically, the client maintains an inmemory buffer. It adds each new unique chunk to the buffer. If adding a new unique chunk causes the buffer to go beyond container size limit the client flushes all chunks in the buffer as a new container and uploads the container to the cloud. Note that after the client reaches the end of a file, it should always upload all chunks in the buffer as a new container to the cloud.
After each file upload, you should report the statistics. Note that the statistics are cumulative ie including all files that have been stored Specifically, we report the following details excluding the metadata statistics:
Total number of files that have been stored
Total number of prededuplicated chunks in storage
Total number of unique chunks in storage
Total number of bytes of prededuplicated chunks in storage denoted by s
Total number of bytes of unique chunks in storage denoted by s
Total number of containers in storage
Deduplication ratio: s srounded to two decimal places
Chunking
We use Rabin fingerprinting for variablesize chunking; see lecture notes for details. In particular, we divide chunks by checking if an offset matches an anchorpoint criterion. An anchor point is determined by the anchor mask with multiple bits. If the bitwise AND operation of the Rabin fingerprint and the anchor mask is equal to zero, we have an anchor point.
A data chunk is defined by the byte range starting from the first byte right after the previous anchor point or the beginning of the file to the end of the current anchor point. While we reach the end of the file, we simply produce a chunk between the last anchor point and the end of the file, even though the length of the new chunk can be very small.
MyDedup takes the following parameters from the command line as input parameters: i minchunk, the minimum chunk size in bytesii avgchunk, the average chunk size in bytes and iii maxchunk, the maximum chunk size in bytes Each chunk size parameter is required to be a power of ; please return an error otherwise. We assume that the multiplier d is equal to
Chunks are identified based on MD
Download
Given the pathname, MyDedup retrieves chunks in containers and reconstructs the original file.
Delete
Given the pathname, MyDedup deletes the file from the cloud. If a deleted chunk is no longer shared by any other file, its entry in the index structure should be removed. However, its physical copy may remain in the original container. If all chunks in a container are deleted and not shared by any other file, the container should also be physically removed from the storage backend. Local Storage Backend
Your program supports the local storage backend, which is used to mimic a cloud storage site. All data chunks will be stored under the directory data
Sample InputOutput Format
Upload:
java MyDedup upload
Report Output:
Total number of files that have been stored:
Total number of prededuplicated ch
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
