Question: Lab: Create a topical crawler in R Due: Submit your R code on Moodle along with a screenshot of your results. Requirements: Follow the basic
Lab: Create a topical crawler in R Due: Submit your R code on Moodle along with a screenshot of your results.
Requirements:
Follow the basic crawling algorithm provided in the slides Crawl 50 pages for your repository Only store websites which contain have at least one term in the body text from a list of keywords chosen by your group Store the following information in a character vector:
Error checking requirements: If the link in the frontier is a link to a jpg, go to the next item in the frontier If the retrieved page is less than 10 characters, go to the next item in the frontier Check for relative/absolute paths when adding to the frontier You may come across other implementation challenges during testing
Hints: Packages that will be useful: RCurl, XML, stringr, httr
getURL call: doc <- tryCatch(getURL(exploredlink),error=function(cond){return("")})
get the title: titleText <- xmlToDataFrame(nodes = getNodeSet(doc, "//title")) titleText <- as.vector(titleText$text) titleText <- unique(titleText)
Retreives the body text from a page: bodyText<-tryCatch(htmlToText(content(GET(exploredlink),type="text/html",as="text")),error=function(cond){return("")})
Parses words into a vector: bodyText<-str_split(tolower(str_replace_all((str_replace_all(bodyText,"(\\t|\ |\ )"," ")),"\\s{2,}"," "))," ")[[1]]
Parsing links from a page: anchor <- getNodeSet(doc, "//a") anchor <- sapply(anchor, function(x) xmlGetAttr(x, "href"))
any() operator will check for true values in a logical vector
x %in% y will check for x membership in y
SAMPLE CODE FOR USE :
#Write a topical crawler using the information provided below:
##Start your code with these libraries: library(RCurl) library(XML) library(stringr) library(httr)
htmlToText <- function(input, ...) { ###---PACKAGES ---### require(RCurl) require(XML) ###--- LOCAL FUNCTIONS ---### # Determine how to grab html for a single input element evaluate_input <- function(input) { # if input is a .html file if(file.exists(input)) { char.vec <- readLines(input, warn = FALSE) return(paste(char.vec, collapse = "")) } # if input is html text if(grepl("
