Question: Project Description Web Crawler For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a
Project Description Web Crawler
For this project, you will be completing a web crawler. The web crawler will start at some web page, follow a random link on that page, then follow a random link on that new page, and so on, keeping a list of the pages that it has visited. Your part of this project will be to complete the linked list data structure that stores the web addresses that the crawler visits. Note that this project deals with a singly-linked list (unlike the doubly-linked list we discussed in class). Singlylinked lists are simpler, because each node just points to the next node (i.e., no previous pointers). The last node in the list has a NULL next pointer to indicate the end. (FYI, the disadvantage with singly-linked lists is that you cant traverse them backwards, while doubly-linked lists let you go forward and backward.)
The file crawler.c contains some code already written. You should only modify that file in the places that are marked "TODO: complete this". There are comments to help guide you through the code. Note that there are four functions that operate on the linked list that you need to complete: contains, insertBack, printAddresses, and destroyList. Each of these functions must be recursive. The comments indicate how those functions should behave. Only "printAddresses" should print any output to the terminal.
How to Run the Program
When completed, the program will take a web address and a number of hops from the command line and try to crawl the web for the given number of hops starting from the given address. It will print out the addresses that it found as it was crawling, along with some messages if it found a cycle or if it ran into a dead end. For example, if you run
./crawler "http://www.ipfw.edu" 10
you might see output like
seed = 1486570735
Cycle detected on hop 2: address http://www.ipfw.edu
Cycle detected on hop 8: address http://www.ipfw.edu/search/sitemap.html
Cycle detected on hop 8: address http://www.ipfw.edu http://www.ipfw.edu http://www.ipfw.edu/search/sitemap.html
http://library.ipfw.edu/
http://library.ipfw.edu/about/gifts-endowment.html
http://ipfw.edu/accreditation/
http://www.ipfw.edu/vcsa/
http://calendar.ipfw.edu/
http://www.ipfw.edu/
http://inside.ipfw.edu/
http://inside.ipfw.edu/tagged/iu
http://inside.ipfw.edu/tagged/ipfw
Because the crawler follows random links, your output will probably not match this exactly. The randomness can make debugging difficult, because running the program different times will usually lead to different execution paths. To alleviate this problem, the program prints out "seed = " at the beginning. If you add this seed as an additional command-line argument when running your program, like
./crawler "http://www.ipfw.edu" 10 1486570735
then you should get exactly the same behavior as when previously using that seed (until the web pages change!). You can do this with any seed value in order to get the same behavior multiple times.
A couple of other notes that arise from dealing with real data:
1. You might get errors from the Python script "getLinks.py" that parses the web pages (because they might be malformed or in an unexpected language). If so, just rerun your program to get a different random crawling trajectory.
2. Because of the network traffic involved in fetching these pages, you should limit the number of hops to 50 or fewer when you are running your project.
To successfully complete the assignment, you must complete the four linked-list functions in crawler.c so that they are RECURSIVE functions that behave as the comments require. Your project should not have any memory leaks. Use the following to find leaks
valgrind --leak-check=yes ./crawler
Here is the code that you need to complete some function of it:
crawler.c
#include #include #include #define MAX_ADDR_LENGTH 1000 /* * a node in our linked-list of web addresses */ struct listNode{ char addr[MAX_ADDR_LENGTH]; struct listNode *next; }; /* * returns 1 if the list starting at pNode contains the address addr, * and returns 0 otherwise */ int contains(const struct listNode *pNode, const char *addr); /* * inserts the address addr as a new listNode at the end of * the list */ void insertBack(struct listNode *pNode, const char *addr); /* * prints the addresses from pNode to the end of the list, * one on each line */ void printAddresses(const struct listNode *pNode); /* * frees the memory associated with this node and all subsequent nodes */ void destroyList(struct listNode *pNode); /* * srcAddr should be a web address (e.g., http://www.yahoo.com). * link should point to an array of maxLinkLength characters. * getLink returns 1 if a link was found and 0 otherwise. * If a link was found, "link" will be filled in with the web address. */ int getLink(const char* srcAddr, char* link, const int maxLinkLength); int main(int argc, char** argv){ long seed; char startAddr[MAX_ADDR_LENGTH]; char destAddr[MAX_ADDR_LENGTH]; int hopNum, numHops; struct listNode *pListStart; if(argc < 3 || argc > 4){ fprintf(stderr, "USAGE: %s startAddr numHops [rand seed]", argv[0]); return -1; } /* initialization */ if(argc >= 4){ seed = atol(argv[3]); } else{ seed = time(NULL); } printf("seed = %ld ", seed); srand(seed); strncpy(startAddr, argv[1], MAX_ADDR_LENGTH); startAddr[MAX_ADDR_LENGTH - 1] = '\0'; numHops = atoi(argv[2]); pListStart = malloc(sizeof(struct listNode)); if(pListStart == NULL){ fprintf(stderr, "ERROR: could not allocate memory "); return -2; } strncpy(pListStart->addr, startAddr, MAX_ADDR_LENGTH); pListStart->next = NULL; /* start the crawling */ for(hopNum=1; hopNum <= numHops; hopNum++){ int res = getLink(startAddr, destAddr, MAX_ADDR_LENGTH); if(!res){ printf("Dead end on hop %d: no outgoing links ", hopNum); break; } if(contains(pListStart, destAddr)){ printf("Cycle detected on hop %d: address %s ", hopNum, destAddr); hopNum--; // try again for this hop in the next iteration } else{ insertBack(pListStart, destAddr); strncpy(startAddr, destAddr, MAX_ADDR_LENGTH); } } printAddresses(pListStart); destroyList(pListStart); return 0; } /* * returns 1 if the list starting at pNode contains the address addr, * and returns 0 otherwise */ int contains(const struct listNode *pNode, const char *addr){ TODO: complete this } /* * inserts the address addr as a new listNode at the end of * the list */ void insertBack(struct listNode *pNode, const char *addr){ TODO: complete this } /* * prints the addresses from pNode to the end of the list, * one on each line */ void printAddresses(const struct listNode *pNode){ TODO: complete this } /* * frees the memory associated with this node and all subsequent nodes */ void destroyList(struct listNode *pNode){ TODO: complete this } int getLink(const char* srcAddr, char* link, const int maxLinkLength){ const int bufSize = 1000; char buffer[bufSize]; int numLinks = 0; FILE *pipe; snprintf(buffer, bufSize, "curl -s \"%s\" | python getLinks.py", srcAddr); pipe = popen(buffer, "r"); if(pipe == NULL){ fprintf(stderr, "ERROR: could not open the pipe for command %s ", buffer); return 0; } fscanf(pipe, "%d ", &numLinks); if(numLinks > 0){ int linkNum; double r = (double)rand() / ((double)RAND_MAX + 1.0); for(linkNum=0; linkNum 0){ return 1; } else{ return 0; } } Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
