Steps You Should Do When You’re Practicing on a Kaggle Dataset

Leonie M Windari
5 min readJul 13, 2021

Are you :

  • A beginner in Data Science?
  • Want to practice your skills in Data Science?
  • Want to start your first project?

Well if your checklist is full, then you’re in the right place!

In this post I want to share a step by step process on what step you should take when you’re analyzing a dataset.

I’m not a professional, I am mainly just a beginner in Data Science but through my experience working on some projects (even though it’s mainly for practice) I came to this conclusion.

  1. Pick the dataset you want to analyze

The first step is simply to search for a dataset in Kaggle. You could choose a dataset that fits you best, or which you take interest on. Just simply go to Kaggle and go to the Datasets. You can choose different datasets there.

Make sure to choose a dataset that’s not too difficult for you. For example, Heart Attack Analysis & Prediction Dataset may be too difficult for you if you don’t really understand biology. But World Happiness Report 2021 seems like a good option!

This will help you out later when you’re doing a Exploratory Data Analysis and make a conclusion!

2. Determine the goal and purpose

I think this steps is the most important thing! People usually just go straight to processing the data without knowing the purpose of it (this is what I did before!). So it’s really important to ask yourself.

  • What is the goal/purpose? What are we trying to achieve?

Don’t think too much about this! Just simply read the datasets and ask to yourself “What do you want to know from this dataset?”

For example, for the World Happiness Report 2021, you maybe wondering : what country hold the number one ranking for the happiest country? what metrics that are used for determining happiness? What’s making people in some country are happier that the others? is it because of GDP?

  • What type of analysis you want to do?

Second part is to determine what type of analysis that you’re going to do in those datasets. Is it descriptive analysis, where you want to summarize past data, is it diagnostic analysis, where you want to find the cause of the past data, is it predictive analysis, where you want to predict the future outcomes, or prescriptive analysis, where you want to determine what action you should take from the previous analysis.

  • What is your desired outcomes/end results?

Third one is to determined the desired outcomes from your dataset. Is it in the form of graph, line, what kind of data visualization you will get?

I usually write it down on a piece of paper first. This way, I can also think of what visualization fits the best for a datasets.

For example, if you want to show the happiness score of each country by rank, you may prefer to use a horizontal bar chart.

  • Success Criteria

This steps may not be needed for some datasets, it’s all depends on your type of analysis! If you’re doing some predictive analysis or you’re using a model to fits your dataset. You have to determine what metrics that you should use to determine the model success rate. For example through mean squared error, ROC or AUC.

3. Write the Detailed Steps of the Data Analysis!

I find it easier to break down each steps you should take before actually starting the data analysis. I write down the steps that I usually take when I am analyzing the datasets.

  • Read the datasets introduction or explanation

Before doing any kind of data analysis, I make sure to read the explanation about the datasets first, like : how is the data taken, when is the data taken, and what is the definition of each metrics available in the datasets.

  • Check for any missing value/data

After that, I make sure to see whether the datasets have some missing data, if it does, then I determine what I am going to do with that missing data. You can choose to drop it, fill it with median, mean, or any possible value, or simply just ignore it. Usually if the missing value is >50% of the entire datasets, I will choose to drop it but it depends on the size of the datasets. If the datasets have around 100000 rows, then I might choose to keep it.

  • Write down what data cleaning you should do to the datasets

This steps probably will took around 80% of your time! If you’re lucky, the datasets you’re dealing with maybe doesn’t need a lot of data cleaning, but in real life there is no clean data.

This steps will be long so I will make sure to write a seperate article for this!

But what I usually do is :

I make sure the datatype is match the data. If not, change the datatype.

I categorize the variables based on its types (whether it’s categorical or numerical, discrete, temporal (datetime), or continuous).By categorize the variables based on its type, it will be easier for you to write down the data cleaning you should do. Usually, the same variable type will have similar data cleaning steps.

See if I need to transform the data (example : do some normalization or standarization, smoothing, or aggregation)

Check the outliers. This steps depends on what analysis you’re going to do. But if you want to analyze a datasets with points, you may see an outliers and this could affect your analysis or models. So, you can drop some outliers here.

4. Do Exploratory Data Analysis!

After your data is clean, you can analyze and explore the data how-ever you like. Note that when you’re doing this steps, you might miss some data cleaning steps and you have to go back to the previous steps but it’s okay! Exploring the data is also a steps you should take to know whether your data is already clean or not.

5. Write down your summary

After you’re done with EDA, you can write down your summaries from your data analysis, suggestions, or even recomendation. By doing this steps, it will increase your knowledge about the datasets and increase your understanding skills! When you’re analyzing a datasets, you might deal with data that you don’t have a good understanding on it. For example : the heart attack analysis. But this is the challenge, for you to get used to deal with unfamiliar data.

So, good luck!

You can go to my blog or github to see my documentaries on some projects (it’s not the best I know but I am also learning!)

This step by step article is inspired by Krish Naik from this youtube video. You can check his video down bellow. He teach a lot of data science video so make sure to check his channel!

--

--

Leonie M Windari

a curious human being. current enemies : manual data entry. current motivation : weekends and deadlines.