Answering FAQ from a Data Science Newbie from 2020 Kaggle Machine Learning & Data Science Survey Analysis

Leonie M Windari
13 min readFeb 5, 2021

This post will be focused on the analysis and result, you can view my code on my Github.

tags : python , jupyter notebook , pandas , matplotlib , data wrangling

Project Background

Data science related job is hot in the industry now. A lot of industry realizing the importance of data for their business. It is expected that the job related with data science will continue to increase each year.

Kaggle just held a survey to the Kaggle community where some of them are a professional. Most of the questions is “What … do you regularly used in a daily basis?” I hope to see what kind of programming language a Data Analyst/ Data Scientist usually used, what machine learning algorithm they usually used, what the most important part of their role and many more.

From that, I hope this can answer some questions that most data science newbie/student asked regarding data science role, because data science is a really big set of knowledge, you can be dizzy and don’t know what to learn at first. For this project, my focus is on the Data Analyst and Data Scientist role. If you know what a Data Analyst and Scientist usually used, you can just learn that first.

P.S. : This reminded me of this TedTalks where to master something, you don’t have to learn everything, you just have to know the pattern.

Dataset

In early this year, Kaggle held an analytics competition of ‘2020 Kaggle Machine Learning & Data Science Survey’. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners.

  • Methodology : The dataset is from a survey that Kaggle did by sending invitation to the entire Kaggle commmunity (through e-mail). It was also promoted on the Kaggle website and Twitter. The survey was held from 7th — 30th of October 2020 in which they received 20,036 responses from 171 different countries and territories.
  • Note : The questions consist of single and multiple choice questions. For the single choice questions, the answer were recorded in an individual columns. While for the multiple choice questions, the responses were split into multiple columns (with one column per answer)
  • Source : Kaggle

Results

How many percentage of the respondent is a Data Analyst/ Data Scientist?

Figure 1. The percentage of the respondent based on role

One of the questions that the survey asked was to select one of the role that is the most similar to your current role (or most recent title if retired).

From the figure beside, we can see that Data Scientist role it in the Top 2 of the respondents, while Data Analyst role is in the Top 6. This number is good as it means that those two role have enough respondent from us to drew conclusion from the overall Data Analyst/ Data Scientist population.

Am I too old to start being a Data Analyst/ Data Scientist?

Figure 2. Age range for Data Analyst/ Data Scientist

From the figure, we can see that the Age Range of 25–29 have the most number of Data Analyst and Data Scientist, followed by the age range of 22–24, 30–32, and 35–39.

But it’s hard to draw conclusion that most of the Data Analyst and Scientist is in the age range of 25–29.

There are a possibility that the sample doesn’t represent the whole Data Analyst and Data Science community but instead, the Kaggle community. If we see the top 5 of the age range, it is between the age of 22 to 44. It is likely because most of people in those age are still in their career.

Although, age is just a number. You can start anything at whatever age, it’s never too late to start something.

Gender Differencies in Data Analyst/ Data Scientist

Figure 3. Gender differencies in Data Analyst/ Data Scientist role

We all know that most of the people in the IT industry is men, as we can also see from the figure.

There is a huge difference between men and women in the Data Scientist role where the ratio between man and woman is 5 : 1, while for Data Analyst it’s not that huge but still there is quite a difference with them having ratio of 3 : 1 for man and woman.

Do you have to have a high education to become a Data Analyst/ Data Scientist?

Figure 4. Education rank based on Role

If we look at the figure, we can answer that : No, you don’t, because you can clearly see that there are some Data Analyst and Data Scientist that have no formal education past high school, although the majority of it have a Master’s degree, followed by Bachelor’s degree and Doctoral degree.

But that doesn’t means that you don’t have to study to become a Data Analyst/Data Scientist. There are a possibility that the person who answer that they have no fromal education past high school went through a bootcamp or learn through online courses.

Where should I take my data science courses?

Figure 5. Data Science courses platform used by Data Analyst/Data Scientist

This is probably one of the most FAQ if you’re starting your data science journey. When being asked to select which platforms have you begun or completed data science courses, the majority of Data Analyst and Data Scientist choose Coursera , followed by Udemy, Kaggle Learn Courses, and Data Camp (for Data Analyst) and Udemy, University Courses, and Kaggle Learn Course (for Data Scientist).

While a lot of you are still thinking what is the best data science courses out there and where should you enroll, I say that all of them are equal, it’s all up to you. Like they all say, what you know become nothing if you can’t use it. You can enroll in the best data science course and still don’t understand anything if you don’t implement those knowledge.

My advice is, rather than spending your time contemplating which one you should choose, just pick one randomly and judge it for yourself. You won’t know if you don’t try. Besides, there is nothing that is the best for you, only the one that matches you.

Do I have to know how to code?

Figure 6. Years of Code based on Role

This is probably one of the FAQ too if you’re coming from a non-IT background. Do I have to know how to code?

Majority of the Data Analyst and Data Scientist do. For Data Analyst, most of them already code for 1–2 years, while for Data Scientist, 3–5 years.

There are also respondent who said that they never written code, and most of it are a Data Analyst. It is possible, as not all Data Analyst have to know how to code, the important thing is their analytical skills, code is basicly just something that can help us to analyze a huge amount of data.

If you are a data analyst for a small company, I think that just simply using Excel works and you don’t have to know how to code. But for data scientist, I think that you have to know how to code (I don’t know how to not code if you’re a data scientist, please enlighten me).

But in all terms, code is just something that can help us and make things easier. You use code to analyze because your data is so big that Excel couldn’t handle it, you use code to help you clean up your data faster, and the way you want it.

What programming language should I learn?

Figure 7. Regularly used programming language based on role

This is probably one of the most FAQ when you start learning data science and I’m sure there are a lot of debate too for this.

Majority of Data Analyst and Data Scientiest use Python as the programming language that they used on a regular basis. (I myself also use Python), followed by SQL and R.

This is also the questions that a lot of people who’s starting their data science journey asked, and I will gave the same answer like before. First, you have to know SQL, as it is used for you to retrieve data from databases and second, rather than contemplating what language you should choose between Python or R, I say you pick one of them and master them. I don’t advice you to learn both language at the same time (but it’s okay if you do) but I think that if you know the basic, if you know what you want to do, language is not a problem.

Do Data Analyst/ Data Scientist have to know machine learning? What machine learning algorithm is usually used?

Figure 8. Years used machine learning

When asked for how many years have they used machine learning methods, most of the Data Analyst says that they have used machine learning method for Under 1 year, while Data Scientist use it for 1–2 years.

There are some respondent who answer that they do not use machine learning methods and most of it are a Data Analyst.

So, should you learn machine learning? I said, yes but you don’t have to master all of it. Let’s see what machine learning algorithm that they used the most on regular basis.

From the graph beside, we can see that the most machine learning method used is Linear or Logistic Regression, followed by Decision Trees or Random Forests, and Gradient Boosting Machines (xgboost, lightgbm, etc.).

There are respondent who also answer ‘None’ and most of it come from Data Analyst role. It is possible since data analyst usually don’t really use machine learning.

So, should you learn it? I say yes, but just the basic like linear/logistic regression and classification method. You don’t have to know all of it, it’s better to know little machine learning method but master them all than knowing all of it but only on the surface.

Figure 9. Regularly used machine learning algorithm

What integrated development environments (IDEs) should I use?

Figure 10. Regularly used IDEs

Most Data Analyst and Data Scientist says that they use Jupyter (JupyterLab, Jupyter Notebooks, etc) as the IDEs they use regularly. It make sense since most of them choose Python as the programming language they used regularly. Then followed by RStudio and Visual Studio Code (VSCode).

Personally, I like to use Jupyter Notebook a lot because it’s convenience to use for data analysis and data visualization.

What data visualization libraries or tools should I learn?

Figure 11. Regularly used visualization libraries

What is an analysis without a visualization?

With visualization, we can see the result of the analysis clearly without having to think much. Like, we can draw a conclusion quickly if we look at graph rather than just a table of numbers, right?

When being asked “What data visualization libraries or tools do you use on a regular basis?” , respondent from Data Analyst and Data Scientist role choose Matplotlib and then followed by Seaborn, Ggplot, and Plotly.

Matplotlib is a great libraries to use for data visualization as it is equipped (histogram, bar graph, scatter plot), while seaborn is more a visualization to see the correlation between variables (heatmap, boxplot) and it’s also based on matplotlib.

What activities that make up an important part of their role at work (Data Analyst vs Data Scientist)

Figure 12. Important activities of a Data Analyst/ Data Scientist role

What make up the work of a Data Analyst and Data Scientist?

From the survey, when the respondent were asked to select any activities that make up an important part of your role at work, both Data Analyst and Data Scientist choose Analyze and understand data to influence product or business decisions, and then for Data Analyst the second most pick answer is Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data, while for Data Scientist is Build prototypes to explore applying machine learning to new areas.

Her we can see that for both Data Analyst and Data Scientist, Analyzing and understanding data is the most important port of their role, while for Data Analyst the second one is building or running the data infrastructure, and for Data Scientist is building prototypes to build machine learning. We can see that machine learning is a really important part of being a Data Scientist.

What are their favorite media sources that report on data science topics? and where do they publicly share or deploy your data analysis or machine learning applications?

Figure 13. Favorite media sources of data science

When being asked “Who/what are your favorite media sources that report on data science topics?“, most Data Analyst and Data Scientist choose Kaggle as their favorite media sources, followed by Youtube, Blogs and Twitter.

Kaggle is a good platform for data science, it has a lot of practice dataset that you can use, even real life dataset you can analyze, they regularly hold a competition, and they even have learning courses there.

But the result could be biased since the one who conduct the survey is Kaggle themself to the Kaggle community.

If you’re trying to jump into a data science related jobs, it is good if you have a portfolio to show your works. Most of the Data Analyst and Scientist when being asked “Where do you publicly share or deploy your data analysis or machine learning applications?”, they choose Github as their first pick, followed by I do not share my work publicly, and Kaggle.

Figure 14. Platform to share data analysis applications

Github is a great place to build your portfolio and it’s a well known platform for it. Most of the Data Analyst and Scientist doesn’t share their work publicly, it is possible because they could be analyzing a data that is not open to public.

What big data products (relational databases, data warehouses, data lakes, or similar) do they use on a regular basis?

Figure 15. Regularly used big data product

If you want to be a Data Analyst/ Scientist, you have to know how to get your data. That’s why you have to learn SQL, think of it as a language that you have to use to get your data from someone called ‘database’.

Here we can see that a Data Analyst/ Scientist mostly used MySQL as their regularly used big data products, followed by Microsoft SQL Server, and PostgresSQL. There are quite a lot who answer ‘None’ as there are a possibility that they don’t use a big data product to retrieve their data.

What business intelligence tools do you use on a regular basis?

Figure 16. Regularly used Business Intelligence tools

Business Intelligence Tools is basicly a data visualization, but it is more interactive. If you can make a plot with matplotlib, with BI tools, you can make a dashboard and story for your data so it’s more efficient to look at. You can choose how you want to look at your data with just one screen, rather than seeing multiple screen of different visualization.

Data Analyst and Scientist choose Tableau as their most regularly used Business Intelligence tools, followed by Microsoft Power BI and Google Data Studio.

Conclusion

  1. Data Scientist is the Top 2 of respondent, while Data Analyst is in the Top 6.
  2. Most of the Data Scientist and Analyst fell in the age range of 25–29.
  3. The gender proportion between men and women in Data Analyst role is 3 : 1, while in Data Scientist role is 5 : 1.
  4. Most Data Analyst and Scientist have a Master’s degree, followed by Bachelor’s and Doctoral degree.
  5. Data Analyst and Scientist choose Coursera as their platform where they finish their data science courses.
  6. Most Data Analyst have been writing code and/or programming for 1–2 years, while Data Scientist for 3–5 years.
  7. Both Data Analyst and Scientist choose Python as the programming language they use regularly.
  8. Most Data Analyst have used machine learning method for under 1 year, while Data Scientist for 1–2 years.
  9. Both Data Analyst and Scientist choose Linear or Logistic Regression as the Machine Learning algorithm they used regularly.
  10. Both Data Analyst and Scientist choose Jupyter (JupyterLab, Jupyter Notebooks, etc) as the IDEs they use regularly.
  11. Both Data Analyst and Scientist choose Matplotlib as the data visualization libraries or tools they use regularly.
  12. Both Data Analyst and Scientist choose Analyze and understand data to influence product or business decisions as the important activities of their role at work.
  13. Both Data Analyst and Scientist choose Kaggle as their favorite media sources that report on data science topics.
  14. Both Data Analyst and Scientist choose Github as the platform where they publicly share or deploy their data analysis or machine learning applications.
  15. Both Data Analyst and Scientist choose MySQL as their regularly used big data products.
  16. Both Data Analyst and Scientist choose Tableau as their most regularly used Business Intelligence tools.

Self Evaluation :

  • Proofread once more
  • Add the questions detail on the explanation
  • Change the visualization (so you can see the immediate difference between the two role)

P.S : Forgive me if I’m lacking in many ways, I am still learning but I hope you can point out where you think I was wrong/lacking by commenting below so I can improve on that. Thank you and have a nice day!

--

--

Leonie M Windari

a curious human being. current enemies : manual data entry. current motivation : weekends and deadlines.