This post is part of a series:
In the previous post, I talked about my goal for my YouTube channel which was to gain mastery at data science.
See slide 1
And now, in this post, I want to talk about the first big project that I want to undertake in the context of pursuing that goal.
Defining the Scope of the Project
Namely, the project is to finish 5 Kaggle competitions in the top 10%. To be more specific, I want to finish 5 “completed” Kaggle competitions in the top 10%.
See slide 2
The reason why I want to tackle only “completed” competitions (instead of “active” ones) is because, this way, I can look up what other people have done to get good results and then learn from that. By taking this approach, I should be able to actually get into the top 10% of the competitions. Another, more practical, reason for doing only “completed” competitions, is that this way, I don’t have to wait until the competition is finished to see how I have actually ranked.
Okay, so that is what the project is generally about. However, on top of that, I want to introduce two more restrictions in order to narrow down the scope of this project.
See slide 3
Namely, the first one is that I am only going to consider competitions with tabular data. So, at the moment, I am not interested in competitions about computer vision or natural language processing. That is something that I want to do more in depth later on.
And the second restriction is that I only want to use “traditional” machine learning algorithms like logistic regression or random forests. So, I am not interested in using any deep learning algorithms. That is also something that I want to do more in depth later on.
Okay, so that is the scope of the project. And now, let me list out all the things that, I think, I will need in order to complete this project.
See slide 4
Road Map for completing the Project
The first thing, that I will need, is some knowledge about Kaggle itself.
See slide 5
And that’s just so that I understand how the website is generally structured. And that’s because, this way, I am best prepared to make the most of all the available resource and therefore make sure I learn as much as possible from the competitions.
Next up, I will need some knowledge about machine learning.
See slide 6
And here, in particular, I need to know how the machine learning process works. So, how to prepare the data and train and evaluate the machine learning models. And then, I also need to know how the machine learning algorithms themselves work.
The reason why I separated the “process” from the “algorithms” is because, technically, you only need to know how the “process” works in order to finish a Kaggle competition. You could, for example, just use a grid search to find the best values for the hyperparameters of an algorithm. However, you are obviously better prepared to train the algorithms, if you know how they actually work. So that’s why, I made this separation.
Okay, so then, lastly, I will also need some knowledge about the actual tools to implement the project. And here, first and foremost, I need to know Python.
See slide 7
This is, probably, the most used programming language in machine learning. So, it makes sense to have a solid understanding of that. And then, with this as the foundation, I will also need some knowledge about some specific Python libraries.
See slide 8
So, in order to load and manipulate the data, I will need to know Pandas. Then, in order to visualize the data and do exploratory data analysis, I will need to know a plotting library. And here, I am actually not sure yet which library I should learn. That’s why I wrote down both Matplotlib and Plotly. And lastly, in order to do the actual machine learning, I will need to know Scikit-learn.
So, those are all the things that, I think, I will need to know in order to complete the project. And for all those topics I will create tutorials.
And here, you might think that it is not really necessary to study all those things in depth, if I just want to “finish” 5 Kaggle competitions. And you might be right if I actually just wanted to “finish” them. In that case, I could, for example, simply look at the Kaggle forums to see what other people have done and then just copy the things that I need to get into the top 10%.
However, at the end of the project, my goal is to be able to basically pick a competition at random and then be able to produce very good results without looking anything up in the Kaggle forums. And therefor, I will need a solid understanding of all the listed things.
So now, let me write out the order in which I actually intend to learn them so that you know what tutorials you can expect next.
See slide 9
So first, I will cover Python, Pandas and a plotting library. Then, I will get to the machine learning part. And here, I have also listed all the algorithms that I intend to cover. And, as you can see, I have actually already covered three of those, namely Decision Trees, Random Forests and Naive Bayes. After that, I cover Scikit-learn and Kaggle. And then, finally, I will do the 5 Kaggle competitions. And here, I have also listed the first two Kaggle competitions that I want to do, namely the Titanic competition and the House Prices competition. Those are both competitions where a lot of people have already participated. So, I think they are a good starting point for this project.
Okay, so this is the road map for completing the project. So, we are basically at the end of this post. But before I actually end it, let me first refer back to the previous post.
Prediction for the Number of Subscribers
Namely, as I mentioned earlier, in that post I stated that my goal is to gain mastery at data science.
See slide 10
And the way I decided to measure that “mastery” is by using the number of subscribers that I have on my YouTube channel.
So now, I would like to make a prediction about how the number of subscribers will grow during and after the project. And my prediction is that I will be at the stage of 20,000 – 40,000 subscribers after I have uploaded the last video for this project (my hope is that I will be even one stage above that at 100,000 subscribers).
See slide 11
And the reason I think that this will happen is because, by completing the project, I will have basically created a single resource where somebody can go from “knowing nothing about Python” to “getting good results in a Kaggle competition”. And, I think, that this would represent a very valuable resource for somebody who wants to learn about machine learning.
So, we will see how good that prediction actually is, once I am done. And now, there is only one thing left to say, namely: Let’s get to work.