Question: Data clean-up using R code: We have a .csv file with only 1 column/field which contains job description. But we have nearly similar description (with

 Data clean-up using R code: We have a .csv file with

Data clean-up using R code: We have a .csv file with only 1 column/field which contains job description. But we have nearly similar description (with little difference) or miss-spelling in the distaste. Our goal is to compare those nearly similar job description (do a fuzzy match/pattern match) and change then to a single kind of description. Remember we not want to consolidate the number of rows thus we will have duplicate entries (that's fine). If we have 5000 rows now after our data transformation/clean-up we will have 5000 rows only. Here are some example cluster wise: Cluster 1: manager (say occurs 100 times) management (occurs 10 times) manger - in - training (occurs 5 times) manager - (occurs 3 times) manager) (occur 4 times) management (occurs 3 times) 1. Create a look-up table which will contain the pattern we will try to much with actual csv file entires.eg. cluster-1 our pattern can 'manage' 2. Use fuzzy match logic or regular repression (regex) or any pattern matching algorithm to match against look-up table $ concert to a meaningful value. e.g. all Cluster - 1 entries (total 125 rows) should be converted to 'manager' as the pattern matches to 'manage'. 3. Make the R code scalable so that new pattern can be added to the table for comparison with data. Some other Cluster example to run unit testing of the R code: Cluster - 2: (match with pattern "engine") engineer engineering engineer engineering sr. engineer engineer/sales Cluster - 3: (match with pattern 'intern') intern internship intern. intern/co-op co-op/interm intern/assistant worker/intern (intern)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!