News Article Scraping || Python || BeautifulSoup
I was learning about Sentiment Analysis and for that purpose, I was in need of news article in CSV format, so now to get those news articles in CSV format I came up with the solution of Web scraping those articles with the help of a python library called " BeautifulSoup " which is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
Now, how & what I did is shown here below in the code.
#Importing libraries
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet= 1
upperframe=[]
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'https://www.marketwatch.com/markets?mod=top_nav/?page='+str(page)
print(url)
#an exception might be thrown, so the code should be in a try-except block
try:
#use the browser to get the url. This is a suspicious command that might blow up.
page=requests.get(url)
except Exception as e:
error_type, error_obj, error_info = sys.exc_info()
print ('ERROR FOR LINK:',url)
print (error_type, 'Line:', error_info.tb_lineno)
continue
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find_all('div',attrs={'class':'element element--article '})
print(len(links))
filename="NEWS.csv"
f=open(filename,"w", encoding = 'utf-8')
headers="Statement\n"
f.write(headers)
for j in links:
Statement = j.find("div",attrs={'class':'article__content'}).text.strip()
Link = "https://www.marketwatch.com/latest-news"
frame.append((Statement))
f.write(Statement.replace(",","^")+",""\n")
upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Statement'])
data
in the code above, I'm scraping data from the website "marketwatch.com"
Comments
Post a Comment