Right now, I’m in a fairly unique position. On the one hand I’m writing a book (The Science of Data Science), which I hope will be as inclusive and as easy to read as possible. On the other, I’m trying to settle on a topic for my PhD thesis, which means going out to the edges of the known and poking around to see where the wall gives.
In the mornings I’m reading about tensor algebras and convolutional rings, and in the evenings I’m trying to write the sentence that most clearly explains linear regression. It’s a jarring combination.
Thankfully though, these divergent activities have prompted a kind of revelation in me about the best way to learn these skills.
Everyone who wants to teach data science has to make some assumptions. Do I assume that the definition of the standard deviation is known by my readers? Do I assume they know how to write a function in Python?
Assuming these things about the general population is a losing battle. The reason there’s a skills shortage in data science (and analytics more broadly) is that not enough people have these skills, this foundational knowledge. And so what the course makers do is pile on the preparation.
If you take DataCamp’s “Data Scientist with Python” learning track, you’ll take 14 courses (out of 22) before you do any statistics and another 4 before you build your first Supervised Learning model. The rest is moving data from one place to another, the basics of Python syntax, making charts. (By the way, if you’re thinking that the course for R - the Lingua Franca of statistics - would be any better, it’s not. In that case you have to take 17 courses of 23 before you get to do Correlation and Regression.)
This overpreparation is dangerous. Yes, they assume nothing about the course taker’s skills, but they do make an even worse assumption - they assume that people are interested.
The average online course has a high dropout rate. We’re talking 80%. I can only imagine that the dropout rate for courses that rely on mathematics, statistics, etc. is higher.
Some people want things to be easy, of course. And some enrol in a course just to see what it’s about. (That last one’s a problem of web design and being too eager to convert webpage viewers into ‘students’, which of course they’re not if they never really take the class.)
But some (if not most) of the responsibility for the dropout phenomena has to be directed at the course makers and the syllabus writers. They’re piggybacking on the success of buzzwords and making empty promises. The empty promise isn’t that you’ll get new skills, but that their course will be worth your time.
I’m not going to promote my unfinished book as The Way to learn data science but I think that, for those who are interested in exploring these skills, the following points which I’ve used to inform my writing might help.
If you’re looking to learn data science, here’s what you should do:
1. Get a problem
Humans are remarkable creatures. We can learn anything that we truly want to. But we have to be motivated to do so. And I don’t mean motivated in a wishy-washy way. I mean that when a person has a problem, they’ll try to find the solution. They’re motivated by unease.
You can’t really learn anything until you have to use it to solve a problem. But problems that are passed down from on high don’t interest people. That’s why standardised tests suck, and why you use rote memorisation just to pass. The problems are arbitrary.
So think up a non-arbitrary problem. Something you care about a lot. Sports, finance, health. Use something from your own life. Don’t wait for Kaggle to upload a dataset or announce a competition. There’s something you already care enough about - use that.
I’ve advised companies (here) to not hire people who have only done Kaggle competitions. Here’s why: when you copy a solution you’re admitting that you don’t know the answer, but when you copy a problem you’re admitting you don’t know how to ask questions. Science is asking questions. So ask.
2. Be Audacious
You should absolutely try the most audacious thing you can think of to solve your problem. Would you rather sit in an interview and talk about how you’ve been spending your time learning the basic syntax and methods of a programming language or how you tried (and failed) to implement a reinforcement learning agent to trade stocks?
People hire people. People with goals, ambitions, audacity.
Everyone with internet access can answer any trivial question nearly instantly. I’ve never learned pandas. I google that stuff a hundred times a day. It’s irrelevant information. Don’t memorise documentation, that’s not why it exists. Skip all of that stuff and go straight to the good bits.
Nearly everything you try in data science won’t work. That’s the point. It’s what you do when things go wrong that will separate you from the others.
Take stock of what you didn’t understand. Regress through the steps, explain everything to yourself in the plainest language you can muster. When you start using jargon, it’s time to do some digging. Learn the concepts as needed to solve your problems, that’s the quickest way to build something that works.
If you fail enough over a large enough set of interesting problems, you’ll be unstoppable.
You’ll have the intuition necessary to solve problems for any organisation you care to work in or with. That’s the real power of learning data science the right way. Maths, stats, coding - these things aren’t going away but they will be commoditised. The only way to be resilient to that is to know them so deeply and thoroughly that you can leverage them in the ways most people can’t.
People believe Seth Godin when he tells them that the old systems are done. That the linear progression of high-school to university to a successful career has fallen by the wayside. That the internet has changed all that.
And yet, this is still how we’re teaching technical skills. First you learn x, and then y, and then z.
I bet that some of you are thinking that my method makes no sense. Surely, if you learnt everything you needed to beforehand you’d be able to solve problems quicker later, right?
Maybe. But this is a new field and the data science syllabus is not carved in stone. People shouldn’t have to meander through the topics (and there’s a lot of them) according to what someone else thinks they should learn.
Everybody solves problems a different way and everybody comes to the field with different prior experiences. A course that makes you jump through hoops is just as dangerous as one that throws you in the deep end. Unfortunately, nearly every data science course or book falls into one of these two categories.
So, to learn data science, start solving your own problems.