# The Science of Data Science

#### Solving problems, from first principles.

#### What's in the first chapter?

- Richard Feynman and the Challenger Disaster
- Data Visualisation
- Regression vs Classification vs Clustering
- Linear Regression
- SSE, MSE, RMSE
- Variance and Covariance
- Gradient Descent
- Correlation
- The R-squared Metric
- Feature Importance
- Cross Validation

I’m going to let you in on a little secret – data science always goes wrong.

You collect the wrong data, you use the wrong algorithm, you focus on the wrong problem entirely. All before lunch. Data in the real world is messy, noisy and abundant. Science on the other hand, is precise, enlightening and in woefully short supply.

Through nine real-life data science projects, you’ll learn how to apply the scientific method to uncompromising data to get usable and informative results, and build robust and reliable decision systems. You’ll also learn what to do when things inevitably go wrong.

Science is about experimentation, utility, and the dogged pursuit of reliable answers. This means breaking things, trying everything in your toolbox, and not accepting results you don’t understand. As such, this book contains a lot of models that do not work - at least not right away. This is a feature and not a bug.

Most data science books aim to teach a syntax or promote a framework, this book aims to elucidate a way of thinking.

Getting the right answer on the first try is often impossible, and yet most data science books will have you believe that this is how it goes. What actually happens in practice is that you’ll apply a book’s methods to your own problem, only to end up with half of a solution. We’ll start from where those books end, and you’ll learn which crucial next steps you should take to turn a middling solution into an effective and robust one.

Data science instruction is usually carried out in an algorithm-first fashion - retrofitting imaginary datasets to specific techniques.

We’ll instead start with real problems, the way scientists usually do, and proceed to solve them through the iterative application of computational statistics.

Throughout all of the examples, two focus areas will prevail: getting the right data (how to engineer features and select among them using rigorous statistical techniques) and getting the right data (why separability matters and what you can do to help.)

This book necessarily contains doses of mathematics and statistics but fear not - they are only used to explain why certain things happen. You need only remember the intuition of the formulae to be successful in your own projects.

This is a not a machine learning book, or a statistics book, or a programming book. It is all of these things in only the measure required by the problems we’re trying to solve.

This is a book about how to do real things with real data, and how to do them really well.