Question: #include #include #include #include #include md_parser.h #include util.h using namespace std; typedef enum { NORMALTEXT, LINKTEXT, ISLINK, LINKURL } PARSE_STATE_T; void MDParser::parse(std::string filename, std::set &

#include #include #include #include #include "md_parser.h" #include "util.h" using namespace std; $typedef enum { NORMALTEXT, LINKTEXT, ISLINK, LINKURL } PARSE_STATE_T; void MDParser::parse(std::string filename,$ std::set& allSearchableTerms, std::set& allOutgoingLinks) { // Attempts to open the file. ifstream

#include #include #include #include #include "md_parser.h" #include "util.h"

using namespace std;

typedef enum { NORMALTEXT, LINKTEXT, ISLINK, LINKURL } PARSE_STATE_T;

void MDParser::parse(std::string filename, std::set<:string>& allSearchableTerms, std::set<:string>& allOutgoingLinks) { // Attempts to open the file. ifstream wfile(filename.c_str()); if(!wfile) { throw invalid_argument("Bad webpage filename in MDParser::parse()"); }

// Remove any contents of the sets before starting to parse. allSearchableTerms.clear(); allOutgoingLinks.clear();

// The initial state is parsing a normal term. PARSE_STATE_T state = NORMALTEXT;

// Initialize the current term and link as empty strings. string term = ""; string link = "";

// Get the first character from the file. char c = wfile.get();

// Continue reading from the file until input fails. while(!wfile.fail()) { // Logic for parsing a normal term. if(state == NORMALTEXT) { // ADD YOUR CODE HERE

} // Logic for parsing a link. else if (state == LINKTEXT) { // ADD YOUR CODE HERE

} else if( state == ISLINK ) { // ADD YOUR CODE HERE

} // Else we are in the LINKURL state. else { // ADD YOUR CODE HERE

} // Attempt to get another character from the file. c = wfile.get(); } // ADD ANY REMAINING CODE HERE

// Close the file. wfile.close(); }

std::string MDParser::display_text(std::string filename) { // Attempts to open the file. ifstream wfile(filename.c_str()); if (!wfile) { throw std::invalid_argument("Bad webpage filename in TXTParser::parse()"); } std::string retval;

// The initial state is parsing a normal term. PARSE_STATE_T state = NORMALTEXT;

char c = wfile.get();

// Continue reading from the file until input fails. while (!wfile.fail()) { // Logic for parsing a normal term. if (state == NORMALTEXT) { // The moment we hit a bracket, we input our current working term // into the allSearchableTerms set, reset the current term, and move into // parsing a link. if (c == '[') { state = LINKTEXT; } retval += c; } // Logic for parsing a link. else if (state == LINKTEXT) { // When we hit the closing bracket, then we must be finished getting the link. if (c == ']') { state = ISLINK; } retval += c; } else if (state == ISLINK) { if (c == '(') { state = LINKURL; } else { state = NORMALTEXT; retval += c; } } // Else we are in the LINKURL state. else { // When we hit a closing parenthese then we are done, and the link can be inserted. if (c == ')') { state = NORMALTEXT; } } c = wfile.get(); } return retval; }

Parsing Web Pages Your first challenge is to complete a simplified MD parser. We want our search engine to be able to support alternate file formats (TXT, MD, HTML, etc.) so we created an abstract PageParser class with a parse method. virtual void parse(std::string filename, std::set<:string>& allSearchableWords, std::set<:string>& alloutgoingLinks) = 0; In general, we want to parse files and find all the searchable terms. To simplify our definition of searchable terms, we will consider text consisting of letters, numbers, and consider all other characters as special characters. The interpretation is that any special character (other than letters or numbers) should be used to separate words, but numbers and letters together form words (aka "terms"). For instance, the string Computer-Science, 104 is really, really5times, really#great?I don't_know! should be parsed into the search terms: "Computer", "Science", "104", "is", "really", "really5times", "really", "great", "T", "don", "t", "know". Thus, during parsing, any contiguous sequence of alphanumeric characters form a search term. All other characters (special characters) will be used to split search terms and, for the sake of searching, can be discarded. In addition, you may want to convert searchable terms to a standard (canonical) form so that a search for computer would match a webpage containing Computer. We have provided some functions in util.h/cpp that can help you convert to a standard case. In addition to parsing search terms, the parsers will implement a display_text function to generate a displayable text string. This function strips out links from the text contents of a file and only shows the anchor text. TXT File Parsing We have provided an implementation of a .txt file parser that you may use for reference when completing the following MD parser. We assume .txt file can contain no hyperlinks to other pages, so we only need to parse the text for search terms. Markdown Parsing You should complete the derived MD parser class in md_parser.cpp that implements the parse function to parse a simplified MarkDown format. We will only support normal text and links in our Markdown format and parser. In addition to text, you should be able to parse MD links of the form [anchor text](link_to_file) where anchor text is any text that should actually be displayed on the webpage and contains searchable terms while (link_to_file) is a hyperlink (or just file path) to the desired webpage file. A few notes about these links: The anchor text inside the [] could be anything, except it will not contain any [ ] c, or). It should be parsed like normal text described in the previous paragraph A valid link will have the immediately following the ] . If that is not the case, then the text is not a link You may assume the link_to_file text will not have any spaces and should be read as a single string (don't split on any special characters). There may be text immediately after the closing). You should just treat it as a new word. Text in parentheses that is NOT preceded immediately by [] should not be considered as a link but just normal text. So in the text: "ArrayLists (aka vectors) support O(1) access", aka vectors and 1 should be considered normal text and not a link. The goal of the parser is to extract all unique search terms and identify all the links (i.e. all the link_to_file s found in the (...) part of a link and return them in the allSearchableTerms and alloutgoingLinks sets that were passed-by-reference to the function. If the contents of a file are. [Hello ] world[hi](data.txt)bye. (Table, t-bone) steak. ...then allSearchableTerms should contain: Hello world, hi, bye Table, t bone steak. In addition, alloutgoingLinks should contain just data2.txt . Note that you can return the words in any normalized case you like that would make case-insensite searching easier. You may implement the Markdown parser as you see fit. However, we recommend using a finite state machine (FSM) approach to read the file character by character and use "states" to determine how to process/handle that character and whether text is a normal term, a link, etc. The diagram below shows a potential FSM for parsing Markdown. Here we assume we read 1 character i.e. c) at each iteration until we reach the end of the file and process c as well as use it to transition between states. We can use the isalnum function from the cctype library in C++ to check whether a character is a valid character for a search term. In addition, we assume we maintain two strings: term and link where we can append characters until we are ready to split and start a new term/link. State Diagram for parsing MD pages into terms and links start 1 c == 'T' c == '1' NORMALTEXT if(isalnum(c)) then append c to term else add term to results clear term to LINKTEXT if(isalnum(c)) then append c to term else add term to results clear term to ISLINK if(isalnum(c)) then append c to term else clear link to "" c!= '/' c == ')'. CE='' LINKURL if(c != ')') Append c to link else Add link to results Note: By default, stay in state unless shown transition condition is true

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Tho acoounting records of Brigham Foods, Inc., include the following iloms at December 31, 2021: (Click the icon to view the accounting records.) Road the reaultements. Requirement 1. Show how each...

Hello. I am fairly new to C++ and am writing a program that evaluates expressions. However, I keep getting these errors from the main.cpp file. The main class is listed below and I will list all the...

Project : West World With Messaging Language : C++ Source was made with Visual Studio 2010, on a machine running Windows Vista. This is old source code.. End product should work in Visual Studio...

Implementing Polymorphism Use the classes from Lab 6 (Shape, Rectangle, Circle, and Square) to create polymorphism. Answer the questions below and implement the polymorphism in the appropriate...

book.cpp file BookList Sequence Containers Homework Last updated: Friday, February 12, 2021 The following class diagrams should help you visualize the BookList interface, and to remind you what the...

Could someone help me I am trying to make my program below read my input files into 6 test cases that the program will run for each expression separately. At the bottom are the 6 test cases. Thank...

I failed to provide all of the classes on the previous question. I have a C++ program that takes input from input.txt, the program outputs to console. The program works until the input hits certain...

Revised 8.23.2021 Rasmussen University School of Nursing ATI Proctored Quiz Remediation Template Student Name: Date Click or tap to enter a date. Course and ATI Proctored Quiz Module: Choose an item....

I am struggling with these questions. This is from "The idea of a confidence interval" topic. Please help me figure it out. Thank you so much! Question 4 1 pts 1. If you increase the confidence level...

For this project, you will be using the StatCrunch data file Math 110 Car Models 2019 Dataset. Part 1: Data Summary (125 points) 1. There are 22 variables in the Car Models dataset, corresponding...

Lee Technical Services Inc. was established on June 15, 2008. The clients for whom Lee provided technical services during the remainder of June are listed below. These clients pay Lee the amount...

With reference to evidence, what extent of evidence is required as a basis for the unmodified opinion? For an adverse opinion? For an opinion qualified for GAAP departure?

he following information related to physical units in Bailey Company s manufacturing facility. The company s product passes through each of the four departments in sequential order. Department 1 2 3...

Why is it important to investigate the information supplied by job candidates? How does this tir in with negligent hiring? Discuss the various types of background checks an employer can do ?

How do people respond to cultural diff erences in communication?

What could enhance Barton Hinghams ability to communicate effectively with people who were raised in non-Western cultures?

How does communication shape cultures and social communities?