As we saw in lecture, hashing is very important for managing large data systems. For example,...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
As we saw in lecture, hashing is very important for managing large data systems. For example, it is used to map from data/search keys to the location where that data is stored (memory address, DB block, machine/disk). Another common application is evenly dividing a set of inputs into buckets to process them in parallel, for example when counting large numbers of tuples. In this problem, we'll get a little more intuition about hash functions and collision resolution by looking at some examples. Question 1.1 - Some Intuition about Hash Collisions The details of how different hash functions work, including how they are implemented and their statistical properties, are mostly beyond the scope of this class. However, a feature common to all hash functions is collisions. When dealing with hashing, it's helpful to have some intuition about the frequency of collisions. Recall that a collision occurs whenever for two distinct elements a and b and a hash function h(x), h(a)=h(b). Assume you have a hash function (the actual process by which it operates is unknown) with the following properties: • The outputs (hash values) are of size 8 bits (there are 256 possible hash values/buckets). The hash function distributes its outputs evenly across the entire output range (this property is very desirable in hash functions and is called uniformity). What is the probability of having at least one collision if we are hashing 5 inputs? Give your answer as a numerical value (not an equation) but show some work/reasoning. A simple numerical answer with no reasoning will not NOTE: One of the interesting (and somewhat counter-intuitive) facts about hash functions is that the probability of collision rises much faster than intuition might suggest. Even when hashing relatively small numbers of inputs (w.r.t the output range), the collision probability rapidly approaches 1. See https://en.wikipedia.org/wiki/Birthday_problem for more interesting discussion on this problem. Question 1.2 -Thinking about Collision Resolution In most applications of hashing, how collisions are handled is an important part of the implementation. One example of this is in the implementation of hash partition join, which we discussed in lecture. Consider the case where we wish to join two relations R and S on a shared attribute A. The first step in the process, partitioning, is done using a hash function. We hash tuples from both relations using their attribute A in order to divide them up evenly into a finite number of buckets B. In this process of partitioning, hash collisions may occur. Explain why having many hash collisions in our buckets would harm the efficiency of the hash partition join algorithm. Considering this, a method is needed to address collisions. As described in lecture, this can be done with double hashing. If the initial partition resulted in collisions, these can be resolved by doing a second pass and re-hashing each collided element with a new hash function (on attribute A). The result of this second hash is used as an offset to determine how many bins to move the tuple over. For example, if a tuple is hashed initially to bin 3 and there is a collision, if h2(tuple) = 1 then the tuple will be moved to bin 3 + 1 = 4 (if this again results in a collision, the tuple can be shifted again by h2(tuple) until the collision is resolved). In practice, a common choice for this second hash function is Hash2(key) = K-(key mod K), where K is a prime smaller than the number of buckets B. You are given B = 5. You are also given the following relation R. A C E 93 Red 28 Purple 70 Orange 11 545 Blue 6 88 Brown The initial hash function is h₁(A) = A mod B. The second hash function is h₂(A) = 3 - (A mod 3). 5 7 3 As part of the first step of a hash partition join, partition R using the given hash function. Assume that the tuples are hashed into buckets in the order they appear in the table (from top to bottom). Resolve any collisions with the provided second hash function. You can indicate your answer in the form of a table or by listing the tuple(s) in each bucket (B0, B1, ...). Bonus Note: There are many other collision resolution strategies beyond those mentioned here, each with different advantages and disadvantages which can be analyzed probabilistically. Question 1.3 - Computing Counts of Pairs As seen above, hash functions are useful whenever we need to (relatively) evenly distribute data. Another such application is dividing across multiple machines to process in parallel. Imagine you have created a music app. Users of your app can login, start a session, and then play songs. The database which forms the backend to your app contains a table with tuples (user_id, session id, song id) which represent every time a user has played a song. You wish to see which pairs of songs are most frequently listened to together in the same session. You are given the following information: • There are 10 million users • There are 1 thousand songs • There are 4 billion sessions • The user IDs are 3 bytes • The song IDs are 2 bytes • The session IDs are 4 bytes • The avg. number of unique songs played per session is 5 Assume all your data is stored on hard disks and use the values for disk seek/scan times from the top of the assignment. Assume that data is stored sequentially on disk. Calculate the total time required to compute the counts for each distinct pair. Do not count pairs with the same song (i.e. Song A - Song A) and assume that when counting we do not count different orderings of the same songs as distinct pairs (i.e. if Song A and Song B are played in the same session, we only count the pair Song A - Song B and do not count the pair Song B - Song A). Assume that it takes 1 hr. to sort all pairs using the external merge sort algorithm. Please show some work/reasoning. A simple numerical answer with no reasoning will not count for full points. Question 1.4 - Parallelizing Counting with Hashing Now imagine you have a hash function which maps from any 64-bit input uniformly to a 32-bit hash value. Explain how you could use it to parallelize the counting across n machines. What would the total required time to compute the counts for each pair be once you've used your hash function to divide the pair tuples across 6 separate disks (which can write in parallel)? Please show some work/reasoning. A simple numerical answer with no reasoning will not count. Question 1.5 - Thinking a Little More About Hashing Consider the application of hashing where we want to parallelize a task across n machines (where n is reasonably small, say 100) versus the task of using hashing as a way to map keys to a location (e.g. generating IDS). For each of these two applications, is a low collision rate important? For each of these two applications, is uniformity important? Why? As we saw in lecture, hashing is very important for managing large data systems. For example, it is used to map from data/search keys to the location where that data is stored (memory address, DB block, machine/disk). Another common application is evenly dividing a set of inputs into buckets to process them in parallel, for example when counting large numbers of tuples. In this problem, we'll get a little more intuition about hash functions and collision resolution by looking at some examples. Question 1.1 - Some Intuition about Hash Collisions The details of how different hash functions work, including how they are implemented and their statistical properties, are mostly beyond the scope of this class. However, a feature common to all hash functions is collisions. When dealing with hashing, it's helpful to have some intuition about the frequency of collisions. Recall that a collision occurs whenever for two distinct elements a and b and a hash function h(x), h(a)=h(b). Assume you have a hash function (the actual process by which it operates is unknown) with the following properties: • The outputs (hash values) are of size 8 bits (there are 256 possible hash values/buckets). The hash function distributes its outputs evenly across the entire output range (this property is very desirable in hash functions and is called uniformity). What is the probability of having at least one collision if we are hashing 5 inputs? Give your answer as a numerical value (not an equation) but show some work/reasoning. A simple numerical answer with no reasoning will not NOTE: One of the interesting (and somewhat counter-intuitive) facts about hash functions is that the probability of collision rises much faster than intuition might suggest. Even when hashing relatively small numbers of inputs (w.r.t the output range), the collision probability rapidly approaches 1. See https://en.wikipedia.org/wiki/Birthday_problem for more interesting discussion on this problem. Question 1.2 -Thinking about Collision Resolution In most applications of hashing, how collisions are handled is an important part of the implementation. One example of this is in the implementation of hash partition join, which we discussed in lecture. Consider the case where we wish to join two relations R and S on a shared attribute A. The first step in the process, partitioning, is done using a hash function. We hash tuples from both relations using their attribute A in order to divide them up evenly into a finite number of buckets B. In this process of partitioning, hash collisions may occur. Explain why having many hash collisions in our buckets would harm the efficiency of the hash partition join algorithm. Considering this, a method is needed to address collisions. As described in lecture, this can be done with double hashing. If the initial partition resulted in collisions, these can be resolved by doing a second pass and re-hashing each collided element with a new hash function (on attribute A). The result of this second hash is used as an offset to determine how many bins to move the tuple over. For example, if a tuple is hashed initially to bin 3 and there is a collision, if h2(tuple) = 1 then the tuple will be moved to bin 3 + 1 = 4 (if this again results in a collision, the tuple can be shifted again by h2(tuple) until the collision is resolved). In practice, a common choice for this second hash function is Hash2(key) = K-(key mod K), where K is a prime smaller than the number of buckets B. You are given B = 5. You are also given the following relation R. A C E 93 Red 28 Purple 70 Orange 11 545 Blue 6 88 Brown The initial hash function is h₁(A) = A mod B. The second hash function is h₂(A) = 3 - (A mod 3). 5 7 3 As part of the first step of a hash partition join, partition R using the given hash function. Assume that the tuples are hashed into buckets in the order they appear in the table (from top to bottom). Resolve any collisions with the provided second hash function. You can indicate your answer in the form of a table or by listing the tuple(s) in each bucket (B0, B1, ...). Bonus Note: There are many other collision resolution strategies beyond those mentioned here, each with different advantages and disadvantages which can be analyzed probabilistically. Question 1.3 - Computing Counts of Pairs As seen above, hash functions are useful whenever we need to (relatively) evenly distribute data. Another such application is dividing across multiple machines to process in parallel. Imagine you have created a music app. Users of your app can login, start a session, and then play songs. The database which forms the backend to your app contains a table with tuples (user_id, session id, song id) which represent every time a user has played a song. You wish to see which pairs of songs are most frequently listened to together in the same session. You are given the following information: • There are 10 million users • There are 1 thousand songs • There are 4 billion sessions • The user IDs are 3 bytes • The song IDs are 2 bytes • The session IDs are 4 bytes • The avg. number of unique songs played per session is 5 Assume all your data is stored on hard disks and use the values for disk seek/scan times from the top of the assignment. Assume that data is stored sequentially on disk. Calculate the total time required to compute the counts for each distinct pair. Do not count pairs with the same song (i.e. Song A - Song A) and assume that when counting we do not count different orderings of the same songs as distinct pairs (i.e. if Song A and Song B are played in the same session, we only count the pair Song A - Song B and do not count the pair Song B - Song A). Assume that it takes 1 hr. to sort all pairs using the external merge sort algorithm. Please show some work/reasoning. A simple numerical answer with no reasoning will not count for full points. Question 1.4 - Parallelizing Counting with Hashing Now imagine you have a hash function which maps from any 64-bit input uniformly to a 32-bit hash value. Explain how you could use it to parallelize the counting across n machines. What would the total required time to compute the counts for each pair be once you've used your hash function to divide the pair tuples across 6 separate disks (which can write in parallel)? Please show some work/reasoning. A simple numerical answer with no reasoning will not count. Question 1.5 - Thinking a Little More About Hashing Consider the application of hashing where we want to parallelize a task across n machines (where n is reasonably small, say 100) versus the task of using hashing as a way to map keys to a location (e.g. generating IDS). For each of these two applications, is a low collision rate important? For each of these two applications, is uniformity important? Why?
Expert Answer:
Answer rating: 100% (QA)
To address your questions well need to tackle them one by one so lets start with the first question ... View the full answer
Related Book For
Financial Reporting Financial Statement Analysis and Valuation a strategic perspective
ISBN: 978-1337614689
9th edition
Authors: James M. Wahlen, Stephen P. Baginski, Mark Bradshaw
Posted Date:
Students also viewed these databases questions
-
Describe the C's of credit- Working Capital, Current Asset Management, and Current Liability Management.
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Case Study: Quick Fix Dental Practice Technology requirements Application must be built using Visual Studio 2019 or Visual Studio 2017, professional or enterprise. The community edition is not...
-
In an organization, leaders treat employees as ends when they ______. Multiple choice question. restrict employees' choices allow employees to create their own purposes assume that the employees'...
-
Calculate the torque produced by a 50-N perpendicular force at the end of a 0.2-m-long wrench.
-
Explain the concept of resource leveling. Why is it an important consideration? What are the key steps involved in this process?
-
Risk associated with investments can be categorized as diversifi able risk and undiversifi able risk. This is one reason mutual funds are so popularmany investors money is pooled and numerous stocks...
-
The following (1 through 18) are the balance-related, transaction-related, and presentation and disclosure related audit objectives. Balance-Related Audit Objectives 1. Existence 2. Completeness 3....
-
Please help me solve parts a b and c of this question. Question #1 Your firm is considering a new risky capital expansion program. Based on the information in Exhibit 1, answer the following...
-
(a) Prepare the following consolidated financial statements for Year 6: (i) Income statement (ii) Statement of financial position (b) Calculate goodwill impairment loss and profit attributable to...
-
i+1? (25) What should be the value of ik if the Thevenin equivalent of the 2-terminal a-b in Figure 4 is v = (n =4, R1 = }N, R2 = 12, i, = 3i2) R, iz is i a n:1 + is V3 V. R Figure 4
-
Find the derivative of f(x). f(x) = X 3 2 Write your answer as a constant times a power of x. f'(x) =
-
Over the past decade, technology has evolved so rapidly that it's hard to remember what life was like before the Internet. We have been introduced to smartphones, tablets, wearable technology since...
-
Q 4 A long - lived tangible asset is impaired when a company is not able to recover the asset s carrying amount either through using it or by selling it . The management to identify whether the asset...
-
Discuss how management control can be done throughout the responsibility centres ( RC ) . Use examples.
-
The value of y in the following system of linear equations is: 3.2x+3.4y= 6.6 3.5x+1.4y= 6.9
-
Treasury Stock Transactions Sun Dance Gardens Inc. develops and produces spraying equipment for lawn maintenance and industrial uses. On June 3 of the current year, Sun Dance Gardens Inc. reacquired...
-
In the circuit shown in Figure 4, a battery supplies a constant voltage of 40 V, the inductance is 2 H, the resistance is 10, and l(0) = 0. (a) Find l(t). (b) Find the current after 0.1s.
-
In August 2009, The Walt Disney Company announced that it would acquire Marvel Entertainment, Inc., in a $4 billion cash and common stock deal. On a per-share basis, the consideration given by Disney...
-
Prior to Year 8, Cooper Corporation engaged in a wide variety of industries, including weapons manufacturing under government contracts, information technologies, commercial aircraft manufacturing,...
-
When a firm incurs costs on an item to be used in operations, management must decide whether to treat the cost as an asset or an expense. Assume that a company used cash to acquire machinery expected...
-
A superball is dropped from a 2.00 m height. After it rebounds, it reaches a new height of 1.65 m. Assuming a constant coefficient of restitution, find the (ideal) total distance the ball will travel...
-
Here are some telescoping series problems: a. Verify that \[\sum_{n=1}^{\infty} \frac{1}{(n+2)(n+1)}=\sum_{n=1}^{\infty}\left(\frac{n+1}{n+2}-\frac{n}{n+1} ight)\] b. Find the \(n\)th partial sum of...
-
Determine the radius and interval of convergence of the following infinite series: a. \(\sum_{n=1}^{\infty}(-1)^{n} \frac{(x-1)^{n}}{n}\). b. \(\sum_{n=1}^{\infty} \frac{x^{n}}{2^{n} n!}\). c....
Study smarter with the SolutionInn App