Question: 2. Your Assignment This programming assignment covers sort through Hadoop and Spark on multiple nodes. You must use a Chameleon node using Bare Metal Provisioning

2. Your Assignment This programming assignment covers sort through Hadoop and Spark on multiple nodes. You must use a Chameleon node using Bare Metal Provisioning (https://www.chameleoncloud.org). You must deploy Ubuntu Linux 22.04 using "compute-haswell" nodes, at the IIT sites. Once you create a lease (up to 7 days are allowed), and start your 1 physical node, and Linux boots, you will find yourself with a physical node with 24 CPU cores, 48 hardware threads, 128GB of memory, and 250GB SSD hard drive. You will install your favorite virtualization tools (e.g. virtualbox, LXD/KVM, qemu), and use it to deploy two different type of VMs with the following sizes: tiny.instance (4-cores, 8GB ram, 20GB disk), small.instance (4-cores, 8GB ram, 45GB disk), and large.instance (16-cores, 32GB ram, 180GB disk). This assignment will be broken down into several parts, as outlined below: Hadoop File System and Hadoop Install: Download, install, configure, and start the HDFS system (that is part of Hadoop, https://hadoop.apache.org) on a virtual cluster with 1 large.instance + 1 tiny.instance, and then again on a virtual cluster with 4 small.instances + 1 tiny.instance. You must set replication to 2 (instead of the default 3), or you won't have enough storage capacity to conduct your experiments on the 24GB dataset. Datasets: Once HDFS is operational, you must generate your dataset with gensort (http://www.ordinal.com/gensort.html); you will create 4 workloads: data-3GB, data-6GB, data-12GB, and data- 24GB. You may not have enough room to store them all, and run your compute workloads. Make sure to cleanup after each run. Remember that you will typically need 6X the storage, as you have the original input data (2x)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
