News Article Scraping || Python || BeautifulSoup

I was learning about Sentiment Analysis and for that purpose, I was in need of news article in CSV format, so now to get those news articles in CSV format I came up with the solution of  Web scraping those articles with the help of a python library called " BeautifulSoup " which is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

Now, how & what I did is shown here below in the code.

#Importing libraries
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

pagesToGet= 1

upperframe=[]  


for page in range(1,pagesToGet+1):
    print('processing page :', page)
    url = 'https://www.marketwatch.com/markets?mod=top_nav/?page='+str(page)
    print(url)
    
    #an exception might be thrown, so the code should be in a try-except block
    try:
        #use the browser to get the url. This is a suspicious command that might blow up.
        page=requests.get(url)                             
    
    except Exception as e:                                   
        error_type, error_obj, error_info = sys.exc_info()      
        print ('ERROR FOR LINK:',url)                          
        print (error_type, 'Line:', error_info.tb_lineno)     
        continue                                              
    time.sleep(2)   
    soup=BeautifulSoup(page.text,'html.parser')
    frame=[]
    links=soup.find_all('div',attrs={'class':'element element--article '})
    print(len(links))
    filename="NEWS.csv"
    f=open(filename,"w", encoding = 'utf-8')
    headers="Statement\n"
    f.write(headers)
    
    for j in links:
        Statement = j.find("div",attrs={'class':'article__content'}).text.strip()
        Link = "https://www.marketwatch.com/latest-news"
        frame.append((Statement))
        f.write(Statement.replace(",","^")+",""\n")
    upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Statement'])
data


in the code above, I'm scraping data from the website "marketwatch.com"



Comments

Popular posts from this blog

Net Present Value (NPV) On Python Using Numpy

Commodity Channel Index ( CCI ) Using Python

SMA Trading Strategy Using Python