Question: I want to add another step into the following code: remove duplicate URLs. from bs4 import BeautifulSoup from urllib.request import urlopen from urllib.parse import urljoin

I want to add another step into the following code: remove duplicate URLs.

from bs4 import BeautifulSoup

from urllib.request import urlopen

from urllib.parse import urljoin

import csv

my_url = 'https://www.census.gov/programs-surveys/popest/about/schedule.html'

# opening up connection, grabbing the page

page = urlopen(my_url)

# html parsering

soup = BeautifulSoup(page, 'html.parser')

#save as csv file

with open('index.csv','w') as csv_file:

writer = csv.writer(csv_file)

for link in soup.find_all('a', href=True):

url = link.get('href')

url = urljoin(my_url, url)

print (url)

writer.writerow([url])

I am trying to add this part:

#remove duplicate links

file = open('index.csv', 'w')

links = {}

for link in soup.find_all('a', href=True):

url = link.get('href')

url = urljoin(my_url, url)

if url not in links:

file.write("%s " % url)

links[url] = True

file.close()

Doesn't seem working. I want find all links from the web, all relative links become absolute URLs, no duplicate links, save as CSV file.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!