Question: Lab: Create a topical crawler in R Due: Submit your R code on Moodle along with a screenshot of your results. Requirements: Follow the basic

Lab: Create a topical crawler in R Due: Submit your R code on Moodle along with a screenshot of your results.

Requirements:

Follow the basic crawling algorithm provided in the slides Crawl 50 pages for your repository Only store websites which contain have at least one term in the body text from a list of keywords chosen by your group Store the following information in a character vector:

Error checking requirements: If the link in the frontier is a link to a jpg, go to the next item in the frontier If the retrieved page is less than 10 characters, go to the next item in the frontier Check for relative/absolute paths when adding to the frontier You may come across other implementation challenges during testing

Hints: Packages that will be useful: RCurl, XML, stringr, httr

getURL call: doc <- tryCatch(getURL(exploredlink),error=function(cond){return("")})

get the title: titleText <- xmlToDataFrame(nodes = getNodeSet(doc, "//title")) titleText <- as.vector(titleText$text) titleText <- unique(titleText)

Retreives the body text from a page: bodyText<-tryCatch(htmlToText(content(GET(exploredlink),type="text/html",as="text")),error=function(cond){return("")})

Parses words into a vector: bodyText<-str_split(tolower(str_replace_all((str_replace_all(bodyText,"(\\t|\ |\ )"," ")),"\\s{2,}"," "))," ")[[1]]

Parsing links from a page: anchor <- getNodeSet(doc, "//a") anchor <- sapply(anchor, function(x) xmlGetAttr(x, "href"))

any() operator will check for true values in a logical vector

x %in% y will check for x membership in y

SAMPLE CODE FOR USE :

#Write a topical crawler using the information provided below:

##Start your code with these libraries: library(RCurl) library(XML) library(stringr) library(httr)

htmlToText <- function(input, ...) { ###---PACKAGES ---### require(RCurl) require(XML) ###--- LOCAL FUNCTIONS ---### # Determine how to grab html for a single input element evaluate_input <- function(input) { # if input is a .html file if(file.exists(input)) { char.vec <- readLines(input, warn = FALSE) return(paste(char.vec, collapse = "")) } # if input is html text if(grepl("", input, fixed = TRUE)) return(input) # if input is a URL, probably should use a regex here instead? if(!grepl(" ", input)) { # downolad SSL certificate in case of https problem if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm") return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm")) } # return NULL if none of the conditions above apply return(NULL) } # convert HTML to plain text convert_html_to_text <- function(html) { doc <- htmlParse(html, asText = TRUE) text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue) return(text) } # format text vector into one character string collapse_text <- function(txt) { return(paste(txt, collapse = " ")) } ###--- MAIN ---### # STEP 1: Evaluate input html.list <- lapply(input, evaluate_input) # STEP 2: Extract text from HTML text.list <- lapply(html.list, convert_html_to_text) # STEP 3: Return text text.vector <- sapply(text.list, collapse_text) return(text.vector) }

###Run the function code for htmlToText()(Be sure this function is listed in your Environment)

###Load the first element in the frontier to an "exploredlink" variable

frontier <- c("http://www.cnn.com","http://www.kdnuggets.com","http://news.google.com")

topicwords<-c("technology","school","web","mining","news")

num <- 50 #total number of items to crawl result <- c() j <- 0 #number of items in the repository

while (j < num){

if(length(frontier)<1){ break } #grab the first item in the frontier and place in the "exploredlink" variable exploredlink<-frontier[1] frontier<-frontier[-1] if(str_detect(exploredlink,"\\.jpg$")) { next }

#fill in your code here }

############ USEFUL CODE SNIPPETS ########

#How to get HTML doc <- tryCatch(getURL(exploredlink),error=function(cond){return("")})

if(str_length(doc)<10){ next }

doc <- htmlParse(doc)

domain<-str_extract(exploredlink,pattern = ".*\\.com")

if(is.na(domain)){ next }

###

#How to get a title titleText <- tryCatch(xmlToDataFrame(nodes = getNodeSet(doc, "//title")),error=function(cond){return("")}) if(titleText==""){ next } titleText <- as.vector(titleText$text) titleText <- unique(titleText)

###

#How to get body text bodyText<- tryCatch(htmlToText(content(GET(exploredlink),type="text/html",as="text")),error=function(cond){return("")})

bodyText<-str_split(tolower(str_replace_all((str_replace_all(bodyText,"(\\t|\ |\ )"," ")),"\\s{2,}"," "))," ")[[1]]

###

#How to get links from a page anchor <- getNodeSet(doc, "//a") anchor <- sapply(anchor, function(x) xmlGetAttr(x, "href"))

if(length(anchor)>0){ temp <- c() for(i in 1:length(anchor)){ if(is.null(anchor[[i]])){ next } if(!str_detect(anchor[[i]][1],"^http")){ next } if(str_detect(anchor[[i]][1],domain)){ next } temp <- append(temp,str_trim(anchor[[i]][1])) } anchor <- temp rm(temp) frontier<-append(frontier,anchor) frontier <- unique(frontier) }

###

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Lab: Create a topical crawler in R Due: Submit your R code along with a screenshot of your results. Requirements: Follow the basic crawling algorithm provided in the slides Crawl 50 pages for your...

only question 2.. plz help me.. im offering all my credits. i swear... Faculty of Business, Economics & Accounting Department of Accounting and Finance HELP Bachelor of Business (Hons) Year 3...

Competencies 4159.1.1 : Profiles Data The learner interprets a data dictionary to understand the data set. 4159.1.2 : Interprets Statistics and Visualization The learner interprets probability,...

I would like assist for my assessment. The subject is Subsidiary Accounts and Ledgers and Foundation Skills. T-1.8.1 Details of Assessment Term and Year 1, 2017 Time allowed Week 2-7 Assessment No 1...

ISFM-300 Case Study, Stage 2: Business Process Analysis and Functional Requirements Before you begin this assignment, be sure you: 1. Have completed all previously assigned readings, particularly...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

Table of Contents Main Objective of the assessment 1 Description of the Assessment 1 Learning Outcomes and Marking Criteria. 4 Format of the Assessment 6 Submission Instructions. 7 Avoiding...

Question 1 , Question 2 and Question 5 part (ii) thanks alot Faculty of Business, Economics & Accounting Department of Accounting and Finance HELP Bachelor of Business (Hons) Year 3 INTERNAL SUBJECT...

Supply the missing data in the following cases. Each case is independent of the others. Case 2 3 Direct materials. $6,000 $4,500 $5,000 $3,000 Direct labor... S7,000 $4,000 $3,000 Manufacturing...

Advertising is said to be a deposit in the brand equity bank, but only if the advertising is strong. Explain.

\ table [ [ # 3 , Category,Prior Year Current Year ] , [ , Accounts payable, 3 , 1 3 2 . 0 0 , 5 , 9 3 6 . 0 0 ] , [ , Accounts receivable, 6 , 9 7 1 . 0 0 , 9 , 0 2 0 . 0 0 ] , [ , Accruals, 5 , 6 5...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

2. How will participants interact with each other and with decision-makers?

1. Who should participate and how will participants be recruited?

3. How would this philosophy fit in your organization?