Working On Finance Dataset || KPMG || Python
Here I have a dataset on which we can practice our python and finance skills, I got this dataset from a website that provides virtual internship
One of them is KPMG Virtual Internship which is through InsideSherpa.
The KPMG Internship gives you the dataset and they ask you to :
- Data Quality Assessment
- Data insight
- Data Insight and Presentation
But here I'm going to work on data quality Assessment through python. So let's Start!
Now in this Virtual Internship, they will provide a Data Set, you can download the data set directly from the InsideSherpa KPMG Virtual Internship Portal.
Here you'll get 5 sheets in this data set, first one is only the instruction sheet you can read that for the do's and don't, the rest 4 sheets contain data that we have to assist.
The first thing you have to do is open this xlsx file in your python environment, I'm using Google Colab.
I've open the dataset in my environment like this.
This is how I've successfully uploaded the dataset in Google Colab. Now you can call your dataset on screen and do the Data Quality Assessment.
Now here I have a suggestion for you, in this dataset, every sheet has a line written on top, and that line will make it difficult for you to run python commands on the dataset.
Delete this particular line from every sheet of the DataSet in the beginning. Deleting this line is not a part of the Data Quality Assessment at all instead we are deleting this to make it easy for us to do the assessment.
Now we will call the data frame and start with the assessment.
import pandas as pd
import numpy as np
df = pd.read_excel('/KPMG_VI_New_raw_data_update_final.xlsx', sheet_name="Transactions")
![]() |
This is how you can copy the file path. |
Now the first thing that I'll do is df.info() to get the information regarding DataFrame.
Here I'll get the basic information about the dataset that how many rows and columns are there and how many values are there in each column and what data type it is using.
You can see that the 12th column "Product_first_sold_date" should be in datetime format but instead it is in timestamp format, so we have to change that into datetime. Now in order to change that what we'll do is:
df['product_first_sold_date'] = pd.to_datetime(df['product_first_sold_date']).dt.date
Before After
Now we will check for null values.
df.isnull().sum()
Here we can see which column contains how many null values, so we can report them in future.
Now we will go through columns and see what is the status of data in each column.
df.columns
if we want to check the column 'online_order' and see how many values are in there and what is the status of those values then we'll use the value_counts() function.
df['online_order'].value_counts()
So this is how you can work on this dataset, I've done data cleaning and data sorting so far and this is what is to be done as instructed by the internship website.
You can go further and create graphs with this data that will give you more insightful information about the dataset and you can present it in a better way.
Comments
Post a Comment