Introduction

In this notebook we will go through the main concepts of pandas library. First, let's import the pandas library. We use the common alias pd.

Source: fast.ai course link

Random forest is a kind of universal machine learning technique

It can be used for both regression (target is a continuous variable) or classification (target is a categorical variable) problems

It also works with columns of any kinds, like pixel values, zip codes, revenue, etc.

In general, random forest does not overfit (it’s very easy to stop it from overfitting)

You do not need a separate validation set in general. It can tell you how well it generalizes even if you only have one dataset

It has few (if any) statistical assumptions (it doesn’t assume that data is normally distributed, data is linear, or that you need to specify the interactions)

Requires very few feature engineering tactics, so it’s a great place to start. For many different types of situations, you do not have to take the log of the data or multiply interactions together

Most machine learning models (including random forest) cannot directly use categorical columns.

RandomForestRegressor and RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn import metrics