Question: Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a

Project Description Web Crawler

For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then follow a random link on that new page, and so on, keeping a list of the pages that it has visited. Your part of this project will be to complete the linked list data structure that stores the web addresses that the crawler visits. Note that this project deals with a singly-linked list (unlike the doubly-linked list we discussed in class). Singlylinked lists are simpler, because each node just points to the next node (i.e., no previous pointers). The last node in the list has a NULL next pointer to indicate the end. (FYI, the disadvantage with singly-linked lists is that you cant traverse them backwards, while doubly-linked lists let you go forward and backward.)

The file crawler.c contains some code already written. You should only modify that file in the places that are marked "TODO: complete this". There are comments to help guide you through the code. Note that there are four functions that operate on the linked list that you need to complete: contains, insertBack, printAddresses, and destroyList. Each of these functions must be recursive. The comments indicate how those functions should behave. Only "printAddresses" should print any output to the terminal.

How to Run the Program

When completed, the program will take a web address and a number of hops from the command line and try to crawl the web for the given number of hops starting from the given address. It will print out the addresses that it found as it was crawling, along with some messages if it found a cycle or if it ran into a dead end. For example, if you run

./crawler "http://www.ipfw.edu" 10

you might see output like

seed = 1486570735

Cycle detected on hop 2: address http://www.ipfw.edu

Cycle detected on hop 8: address http://www.ipfw.edu/search/sitemap.html

Cycle detected on hop 8: address http://www.ipfw.edu http://www.ipfw.edu http://www.ipfw.edu/search/sitemap.html

http://library.ipfw.edu/

http://library.ipfw.edu/about/gifts-endowment.html

http://ipfw.edu/accreditation/

http://www.ipfw.edu/vcsa/

http://calendar.ipfw.edu/

http://www.ipfw.edu/

http://inside.ipfw.edu/

http://inside.ipfw.edu/tagged/iu

http://inside.ipfw.edu/tagged/ipfw

Because the crawler follows random links, your output will probably not match this exactly. The randomness can make debugging difficult, because running the program different times will usually lead to different execution paths. To alleviate this problem, the program prints out "seed = " at the beginning. If you add this seed as an additional command-line argument when running your program, like

./crawler "http://www.ipfw.edu" 10 1486570735

then you should get exactly the same behavior as when previously using that seed (until the web pages change!). You can do this with any seed value in order to get the same behavior multiple times.

A couple of other notes that arise from dealing with real data:

1. You might get errors from the Python script "getLinks.py" that parses the web pages (because they might be malformed or in an unexpected language). If so, just rerun your program to get a different random crawling trajectory.

2. Because of the network traffic involved in fetching these pages, you should limit the number of hops to 50 or fewer when you are running your project.

To successfully complete the assignment, you must complete the four linked-list functions in crawler.c so that they are RECURSIVE functions that behave as the comments require. Your project should not have any memory leaks. Use the following to find leaks

valgrind --leak-check=yes ./crawler

Here is the code that you need to complete some function of it:

crawler.c

#include #include #include #define MAX_ADDR_LENGTH 1000 /* * a node in our linked-list of web addresses */ struct listNode{ char addr[MAX_ADDR_LENGTH]; struct listNode *next; }; /* * returns 1 if the list starting at pNode contains the address addr, * and returns 0 otherwise */ int contains(const struct listNode *pNode, const char *addr); /* * inserts the address addr as a new listNode at the end of * the list */ void insertBack(struct listNode *pNode, const char *addr); /* * prints the addresses from pNode to the end of the list, * one on each line */ void printAddresses(const struct listNode *pNode); /* * frees the memory associated with this node and all subsequent nodes */ void destroyList(struct listNode *pNode); /* * srcAddr should be a web address (e.g., http://www.yahoo.com). * link should point to an array of maxLinkLength characters. * getLink returns 1 if a link was found and 0 otherwise. * If a link was found, "link" will be filled in with the web address. */ int getLink(const char* srcAddr, char* link, const int maxLinkLength); int main(int argc, char** argv){ long seed; char startAddr[MAX_ADDR_LENGTH]; char destAddr[MAX_ADDR_LENGTH]; int hopNum, numHops; struct listNode *pListStart; if(argc < 3 || argc > 4){ fprintf(stderr, "USAGE: %s startAddr numHops [rand seed]", argv[0]); return -1; } /* initialization */ if(argc >= 4){ seed = atol(argv[3]); } else{ seed = time(NULL); } printf("seed = %ld ", seed); srand(seed); strncpy(startAddr, argv[1], MAX_ADDR_LENGTH); startAddr[MAX_ADDR_LENGTH - 1] = '\0'; numHops = atoi(argv[2]); pListStart = malloc(sizeof(struct listNode)); if(pListStart == NULL){ fprintf(stderr, "ERROR: could not allocate memory "); return -2; } strncpy(pListStart->addr, startAddr, MAX_ADDR_LENGTH); pListStart->next = NULL; /* start the crawling */ for(hopNum=1; hopNum <= numHops; hopNum++){ int res = getLink(startAddr, destAddr, MAX_ADDR_LENGTH); if(!res){ printf("Dead end on hop %d: no outgoing links ", hopNum); break; } if(contains(pListStart, destAddr)){ printf("Cycle detected on hop %d: address %s ", hopNum, destAddr); hopNum--; // try again for this hop in the next iteration } else{ insertBack(pListStart, destAddr); strncpy(startAddr, destAddr, MAX_ADDR_LENGTH); } } printAddresses(pListStart); destroyList(pListStart); return 0; } /* * returns 1 if the list starting at pNode contains the address addr, * and returns 0 otherwise */ int contains(const struct listNode *pNode, const char *addr){ TODO: complete this } /* * inserts the address addr as a new listNode at the end of * the list */ void insertBack(struct listNode *pNode, const char *addr){ TODO: complete this } /* * prints the addresses from pNode to the end of the list, * one on each line */ void printAddresses(const struct listNode *pNode){ TODO: complete this } /* * frees the memory associated with this node and all subsequent nodes */ void destroyList(struct listNode *pNode){ TODO: complete this } int getLink(const char* srcAddr, char* link, const int maxLinkLength){ const int bufSize = 1000; char buffer[bufSize]; int numLinks = 0; FILE *pipe; snprintf(buffer, bufSize, "curl -s \"%s\" | python getLinks.py", srcAddr); pipe = popen(buffer, "r"); if(pipe == NULL){ fprintf(stderr, "ERROR: could not open the pipe for command %s ", buffer); return 0; } fscanf(pipe, "%d ", &numLinks); if(numLinks > 0){ int linkNum; double r = (double)rand() / ((double)RAND_MAX + 1.0); for(linkNum=0; linkNum 0){ return 1; } else{ return 0; } }

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Can anyone help me with this, please :( !! It is due tomorrow !! Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page,...

Can you Help me, please !! Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then...

You are tasked to design a web crawler that explores a website's structure. The website is represented as a graph, where each web page is a node, and each hyperlink between web pages is an edge. The...

I need to complete a web crawler. The web crawler will start at some web page, follow a random link on that page, then follow a random link on that new page, and so on, keeping a list of the pages...

Please read the instructions...I only need help with the emboldened methods in the code i displayed The web crawler will start at some web page, follow a random link on that page, then follow a...

Due Week 4: Work Breakdown Structure According to the PMBOK Guide, "the WBS is a deliverableoriented hierarchical decomposition of the work to be executed by the project team, to accomplish the...

what more information do you need? CS 232 Introduction to Cand Unix Project 3: Building a Web Search Engine Crawler is due Sunday, April 9 by 11:59pm Complete project is due Thursday, April 27 by...

Center for Learning and Technology COURSE SYLLABUS PRINCIPLES OF STATISTICS STA-201-GS Course Syllabus PRINCIPLES OF STATISTICS STA-201-GS Thomas Edison State College May 2015 Course Essentials...

Instructions for submission One of the topics covered in Analysis of Algorithms are algorithms for traversing graphs. The structure of the world-wide-web is an example of a directed graph with each...

Global Automotive Corp. owns 80 shares of the 100 outstanding shares of common stock of Specific Motors Corp. The remaining 20 shares of the Specific Motors stock are owned by Tucker Vehicles, Inc....

For each separate case below, follow the 3-step process for adjusting the prepaid asset account: Step 1: Determine what the current account balance equals. Step 2: Determine what the current account...

ehide the diccont ine is is peroert? Note: Roved intermedate calcidatiena and fias antwer to 2 decimal pleces.

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

LO3 Explain how technology and social networking affect recruiting processes for employers and candidates.

LO5 Highlight five external recruiting sources.

LO6 Define recruiting measurement and metrics and illustrate how analytics can be used to improve talent acquisition.