The objective of this notebook is to attempt to predict the type of sentence given information on the crime using machine learning. See below for examples of the inputs and outputs.
# note: to run the two cells below, you will first have to load the dataset (in a cell below)
# This is an example of the input data that we have to make the prediction (X/input)
sentencing_processed.sample(2).drop(target_raw+target_processed, axis=1)
# These are all of the Sentence Types that can be given (y/output/target)
list(sentencing_processed["categorical_sentence"].unique())
The improvement in accuracy when using CatBoost is fairly modest (about 10 percentage points.) This might be because the information in each row is not sufficient to discern the true gravity of a crime.
An example is UPDATED_OFFENSE_CATEGORY. Two cases might have the same value in
this feature but may be vastly different:
# Training Examples | Model | Test Accuracy (default hyper-parameters) | Time to run |
---|---|---|---|
10,000 | DummyClassifier | 0.58 | 0s |
10,000 | LogisticRegression | 0.65 | 1min 41s |
10,000 | RandomForest | 0.64 | 26s |
10,000 | XGBoost | 0.67 | 11s |
10,000 | CatBoost | 0.67 | 15s |
50,000 | DummyClassifier | 0.58 | 0s |
50,000 | RandomForest | 0.67 | 2min 55s |
50,000 | XGBoost | 0.67 | 1min 13s |
50,000 | CatBoost | 0.69 | 20s |
100,000 | DummyClassifier | 0.58 | 0s |
100,000 | CatBoost | 0.69 | 22s |
100,000 | XGBoost | 0.67 | 1min 54s |
184,000 | DummyClassifier | 0.58 | 0s |
184,000 | CatBoost | 0.69 | 23s |
... [CatBoost] is very stable to changing hyperparameters when you have enough training data. It usually provides almost optimal results with default parameters, so one could save some time on parameter tunning.
... [SHAP] is a game theoretic approach to explain the output of any machine learning model.
Action | Feature | Updated Value | Updated Sentence Category |
---|---|---|---|
Start | Start | Black Male Age z=0.5 "Possesion of Controlled Substance" |
Prison |
Change | Race | White | Probation |
Revert | Race | Black | Prison |
Change | Gender | Female | Probation |
Increase | Age | z=0.9 | Prison |
Revert | Age | z=0.5 | Probation |
Change | Crime | "Armed Robbery" | Prison |
This project was fairly complex and the subject matter is rather serious.
Please read the disclaimers and warnings in Readme.md before drawing any conclusions.
import numpy as np
import pandas as pd
from datetime import datetime
import time
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV
import re
import random
import scipy.stats
import matplotlib.pyplot as plt
! pip install shap
import shap
shap.initjs()
# import scipy.stats
# import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.base import BaseEstimator, ClassifierMixin
# from sklearn.feature_selection import RFE, RFECV
! pip install catboost
! pip install scikit-optimize
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier#, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor
from catboost import CatBoostClassifier, CatBoostRegressor, Pool, CatBoost
from lightgbm import LGBMClassifier, LGBMRegressor
from xgboost import XGBClassifier, XGBRegressor
# Bizzare problem: It seems that when you import one of lbgm, xgboost, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor,
# if the below options are enabled, it will wreck any operation on a dataframe (including h df.head(), causing it to hang indefinitely
# pd.options.display.max_rows = None # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_columns = None # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_colwidth = 100 # To prevent pandas from concatenating very long columns. Set to 0.
# Update path before using - Must be given the Cook County Sentencing Dataset _after_ it has been pre-preocessed by Sentencing_data_cleaning.ipynb.
data_path_colab = "/content/drive/My Drive/Colab Notebooks/Sentencing_processed_data.csv"
data_path_local = "Sentencing_processed_data.csv"
sentencing_processed = pd.read_csv(data_path_colab,
parse_dates=["DISPOSITION_DATE", "SENTENCE_DATE",
"INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE",
"ARREST_DATE", "ARRAIGNMENT_DATE", "RECEIVED_DATE"],
index_col=0)
sentencing = sentencing_processed.copy()
# Time encoding for "ARREST_DATE"
sentencing["month"] = sentencing["ARREST_DATE"].apply(lambda x: x.month)
# sin/cos for seasonality
sentencing["month_sin"] = np.sin(2*np.pi*sentencing["month"]/12)
sentencing["month_cos"] = np.cos(2*np.pi*sentencing["month"]/12)
# linear encoding
min_date = min(sentencing["ARREST_DATE"])
sentencing["days_number"] = (sentencing["ARREST_DATE"] - min_date).dt.days
df_train_1, df_test_1 = train_test_split(sentencing, test_size=0.2, random_state=123)
print(len(df_test_1) / (len(df_test_1) + len(df_train_1)))
len(df_train_1)
y_train_type = df_train_1["categorical_sentence"] # for categorical_sentence
y_train_length = df_train_1["sentence_period_years"] # for sentence_period_years
y_test_type = df_test_1["categorical_sentence"]
y_test_length = df_test_1["sentence_period_years"]
# categorize features for preprocessing
# includes all features that are an outcome of the judicial process and not the crime except for any features
# that are used to classify the crime (since I have no other way of knowing what the crime is).
drop_features = ["CASE_ID", "CASE_PARTICIPANT_ID", "CHARGE_ID", "CHARGE_VERSION_ID", "LENGTH_OF_CASE_in_Days", "SENTENCE_PHASE",
"SENTENCE_TYPE", "COMMITMENT_TYPE", "CURRENT_SENTENCE", "SENTENCE_JUDGE",
"CHARGE_DISPOSITION_REASON", "COURT_NAME", "COURT_FACILITY", "RECEIVED_DATE",
"DISPOSITION_DATE", "SENTENCE_DATE", "INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE", "ARRAIGNMENT_DATE",
"ARREST_DATE", "month", "DISPOSITION_CHARGED_AOIC"]
# drop length of case since this is information from after sentencing
# Drop sentence/commitment type since it has been merged into categorical_sentence
# CHARGE_DISPOSITION_REASON - too many missing features
# ARREST_DATE is dropped because we use the processed date features. month is a intermediary.
numeric_features = ["AGE_AT_INCIDENT", "month_sin", "month_cos", "days_number", "CHARGE_COUNT"]
# features to be one-hot encoded
categorical_features = ["OFFENSE_CATEGORY", "DISPOSITION_CHARGED_OFFENSE_TITLE", "CHARGE_DISPOSITION",
"GENDER", "RACE", "UPDATED_OFFENSE_CATEGORY",
"DISPOSITION_CHARGED_CHAPTER", "DISPOSITION_CHARGED_ACT", "DISPOSITION_CHARGED_SECTION",
"DISPOSITION_CHARGED_CLASS", "INCIDENT_CITY", "LAW_ENFORCEMENT_AGENCY", "UNIT",
"PRIMARY_CHARGE"] #use one-hot encoding with drop first
# UNIT is department of police force which is involved
# AOIC refers to Administrative Office of the Illinois Courts ID
# PRIMARY_CHARGE is boolean
# ordinal encoding
ordinal_features = []
# what we are predicting (y)
target_raw = ["COMMITMENT_TERM", "COMMITMENT_UNIT"] # raw target; will be dropped
target_processed = ["categorical_sentence", "sentence_period_years"]
drop_features = drop_features + target_raw + target_processed
temp_a = list(drop_features + numeric_features + categorical_features + ordinal_features)
temp_a.sort()
temp_b = list(sentencing.columns)
temp_b.sort()
assert (temp_a == temp_b), "Columns do not match"
# Drop target columns - skip if running again
# df_train = df_train.drop(columns=target_raw+target_processed, axis=1, errors='ignore')
# df_test = df_test.drop(columns=target_raw+target_processed, axis=1, errors='ignore')
df_train = df_train_1[numeric_features+categorical_features]
df_test = df_test_1[numeric_features+categorical_features]
df_train[numeric_features] = df_train[numeric_features].astype('float') # ensure all numeric fields are float
# df_train["PRIMARY_CHARGE"] = df_train["PRIMARY_CHARGE"].astype(str) # convert boolean to string
df_train[categorical_features] = df_train[categorical_features].astype(str) # ensure no floats
df_test[numeric_features] = df_test[numeric_features].astype('float')
# df_test["PRIMARY_CHARGE"] = df_test["PRIMARY_CHARGE"].astype(str)
df_test[categorical_features] = df_test[categorical_features].astype(str)
# Note: The code above will lead to missing data for predicting the sentence duration as you should know what kind of sentence is given.
# Ignoring for now, but this means that I should not predict duration without fixing this
categorical_transformer_cat = Pipeline([
('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
])
categorical_transformer_ohe = Pipeline([
('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
numeric_transformer_cat = Pipeline([
('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
('scaler', StandardScaler())
])
numeric_transformer_ohe = Pipeline([
('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
('scaler', StandardScaler())
])
preprocessor_cat = ColumnTransformer([
('numeric', numeric_transformer_cat, numeric_features),
('categorical', categorical_transformer_cat, categorical_features)
], remainder='drop')
preprocessor_ohe = ColumnTransformer([
('numeric', numeric_transformer_ohe, numeric_features),
('categorical', categorical_transformer_ohe, categorical_features)
], remainder='drop')
df_train.isna().sum()
preprocessor_ohe.fit(df_train)
preprocessor_cat.fit(df_train)
preprocessor_cat.fit(df_train)
preprocessor_ohe.fit(df_train)
ohe = preprocessor_ohe.named_transformers_['categorical'].named_steps['onehot']
ohe_feature_names = list(ohe.get_feature_names(categorical_features))
new_columns_cat = numeric_features + categorical_features
new_columns_ohe = numeric_features + ohe_feature_names
X_train_cat = pd.DataFrame(preprocessor_cat.transform(df_train), index=df_train.index, columns=new_columns_cat)
X_test_cat = pd.DataFrame(preprocessor_cat.transform(df_test), index=df_test.index, columns=new_columns_cat)
X_train_ohe = pd.DataFrame(preprocessor_ohe.transform(df_train), index=df_train.index, columns=new_columns_ohe)
X_test_ohe = pd.DataFrame(preprocessor_ohe.transform(df_test), index=df_test.index, columns=new_columns_ohe)
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
# replace any [, ], < in feature name since XGBoost has problems with it
X_train_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train_ohe.columns.values]
X_test_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test_ohe.columns.values]
# replace special characters since LightBGM has problems otherwise - the code below causes issues with XGBoost because it creates non unique features
# X_train_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_train_ohe.columns]
# X_test_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_test_ohe.columns]
modelDC = DummyClassifier(strategy="most_frequent")
# All of the non CatBoost models are disabled because they only run on CPU and
# are too slow (takes about 1 day). You can comment out these lines to run them
# but be sure to initially run on a sample of n=1000 or less.
# modelLR = LogisticRegression(max_iter=1000)
# modelXG = XGBClassifier()
# modelLB = LGBMClassifier()
# Note change task_type to 'CPU' if computer does not have an NVIDIA graphics card and CUDA installed.
# Note: running on CPU is about 400 times slower (and will take about a day).
modelCB = CatBoostClassifier(cat_features=categorical_features, task_type="GPU", verbose=1000)
Run Time with n=100
n=len(X_train_cat) # full dataset
# Removing all except Catboost since they are too slow for running on the whole dataset.
print("DummyClassifier")
%timeit -n1 -r1 modelDC.fit(X_train_ohe.head(n), y_train_type.head(n))
# print("\nLogisticRegression")
# %timeit -n1 -r1 modelLR.fit(X_train_ohe.head(n), y_train_type.head(n))
# print("\nXGBoost")
# %timeit -n1 -r1 modelXG.fit(X_train_ohe.head(n), y_train_type.head(n))
print("\nCatBoost")
%timeit -n1 -r1 modelCB.fit(X_train_cat.head(n), y_train_type.head(n))
# Has problems with the feature names
# print("\nLightBGM")
# %timeit -n1 -r1 modelLB.fit(X_train_ohe.head(n), y_train_type.head(n))
# Print default catboost parameters
# modelCB.get_all_params()
print("DummyClassifier")
print(f"Train: {modelDC.score(X_train_ohe.head(n), y_train_type.head(n))}")
print(f"Test: {modelDC.score(X_test_ohe.head(n), y_test_type.head(n))}")
# print("\nLogisticRegression")
# print(f"Train: {modelLR.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLR.score(X_test_ohe.head(n), y_test_type.head(n))}")
# print("\nXGBoost")
# print(f"Train: {modelXG.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelXG.score(X_test_ohe.head(n), y_test_type.head(n))}")
print("\nCatBoost")
print(f"Train: {modelCB.score(X_train_cat.head(n), y_train_type.head(n))}")
print(f"Test: {modelCB.score(X_test_cat.head(n), y_test_type.head(n))}")
# print("\LightBGM")
# print(f"Train: {modelLB.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLB.score(X_test_ohe.head(n), y_test_type.head(n))}")
explainer = shap.TreeExplainer(modelCB)
shap_values = explainer.shap_values(X_train_cat)
# shap_values.shape
shap.initjs()
# order was verified manually by looking at which column of predict_proba was the highest
class_labels = y_train_type.unique()
class_labels.sort()
class_labels
shap.summary_plot(shap_values, X_train_cat, class_names=class_labels)
Notes:
The only features that make sense in these plots are:
Note: Ignore days_number (an integer value of the date of arrest from some starting date) and all of the other features.
Below, I have analyzed feature importance for 3 out of the 13 possible classes in the target (categorical_sentence).
For predicting bootcamp, the most important feature is AGE_AT_INCIDENT. Younger people are very rarely get a BootCamp sentence.
The effect of age on the likelihood of being sentenced to Life is not monotomic (there is no clear gradation from red to blue).
Charge count very clearly affects probaility of getting Life sentences, with more charges making Life sentence more likely.
class_number = 9
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)
Age very strongly affects the likelihood of being sentenced to Drug Court Probation, with older people being much more likely to be given this sentence.
Also, from the charge count, it is clear that the more charges associated with a person, the less likely he is to be sentenced to Drug Court Probation (and therefore given a much more severe punishment).
class_number = 3
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)
I tested this for CatBoost and it seems that it is not worth to optimize hyper paramers since CatBoost automatically selects very good hyper-paramers (this is a feature of CatBoost).
Even running tens of folds with pparameters faily close to the values CatBoost selects, the best score was still slightly lower than for the automatic parameters.
Leaving the code below for reference.
https://effectiveml.com/using-grid-search-to-optimise-catboost-parameters.html
CatBoost hyperparameters max reasonable range:
Recommended parameter space:
https://github.com/talperetz/hyperspace/tree/master/GBDTs
Another good resource:
https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2
# catboost_params_space = {
# "n_estimators" : scipy.stats.randint(low=500, high=1500), # too large takes too long to run
# "learning_rate": [0.001, 0.01, 0.1, 1],
# "l2_leaf_reg": scipy.stats.randint(low=1, high=100),
# "bootstrap_type": ["Bernoulli", "No", "Poisson", "Bayesian"],
# "one_hot_max_size": scipy.stats.randint(low=10, high=50),
# "max_depth": scipy.stats.randint(low=1, high=11), # too large takes too long to run
# "random_strength": scipy.stats.randint(low=0, high=50),
# "border_count": scipy.stats.randint(low=100, high=255),
# }
# Ran above configuration for 15 folds and it was worse than default parameters
# which means that catboost does a good job selecting default parameters.
# Trying to optimize only a few parameters that are known to be important.
catboost_params_space = {
"n_estimators" : scipy.stats.randint(low=500, high=1500), # default=0.188242, too large takes too long to run
"one_hot_max_size": scipy.stats.randint(low=2, high=20), # default=2
"max_depth": scipy.stats.randint(low=3, high=11), # default=6 too, large takes too long to run
}
# Randomized Search. Do not run in parallel if using CatBoost.
random_search = RandomizedSearchCV(modelCB, param_distributions = catboost_params_space,
n_iter = 5, cv=2, verbose=2,
random_state=123, n_jobs=1)
# Uncomment this code to run. Warning will take a long time even when running on GPU.
# n=len(X_train_cat)
# %timeit -n1 -r1 random_search.fit(X_train_cat.head(n), y_train_type.head(n));
# print(f"Best Parameters - {random_search.best_params_}")
# print(f"Best Score - {random_search.best_score_}")
# Each crime may have multiple charges, so must ensure that two cases from the same crime do not end up in different tain/test splits
# because this will lead to data leaking to the test set.
# Note that during cross validation, I ignore this because it is too much trouble to do the splits properly, so the train scores will likely be inflated
# compared to the test scores, but since the test split is done correctly, the test score will be valid.
# Technically, I shouldn't need to do this if my understanding of how sentencing works (each case gets an independent sentence),
# but doing this to be on the safe side.
# sentencing_sorted = sentencing.sort_values(by=["CASE_ID"])
# random.seed(a=123, version=2) # for reproduceability
# df_train = pd.DataFrame()
# df_test = pd.DataFrame()
# split_ratio = 0.2
# # estimated to take exactly 1 hour on 233k examples
# previous_id = 0
# last_set = 0; # "train" or "test"
# total_length = len(sentencing_sorted)
# for i in range(len(sentencing_sorted)):
# if (i%100 == 0): print(i, time.time())
# curr_line = sentencing_sorted.iloc[i]
# if curr_line["CASE_ID"] == previous_id:
# df_train.append(curr_line) if (last_set=="train") else df_test.append(curr_line)
# else:
# # sample random number to decide which dataset
# if (random.random() < split_ratio):
# last_set = "train"
# df_train.append(curr_line)
# else:
# last_set = "test"
# df_test.append(curr_line)
# print("Done split")
# do nothing estimator; https://scikit-learn.org/stable/developers/develop.html
# Useful if you want to use remainder="drop" when building ColumnTransformer,
# but you still want to keep select features without processing them.
class Nothing(BaseEstimator, ClassifierMixin):
def __init__(self, demo_param='demo'):
self.demo_param = demo_param
def fit(self, X, y):
# Do nothing
print()
return self
def predict(self, X):
# Do nothing
return None
def transform(self, data):
# return data without doing anything
return data
passthrough_transformer = Pipeline([
('do_nothing', Nothing())
])
Something to try which might improve the model:
example = X_train_cat.head(1).copy()
example
# Prediction at start
modelCB.predict(example.head(1))
# Change race from Black to White:
example.loc[56210,"RACE"] = "White"
modelCB.predict(example)
# Change race back to Black:
example.loc[56210,"RACE"] = "Black"
modelCB.predict(example)
# Change Gender from Male to Female:
example.loc[56210,"GENDER"] = "Female"
modelCB.predict(example)
# Make the person older than current age z-score=0.489372
example.loc[56210,"AGE_AT_INCIDENT"] = 0.9
modelCB.predict(example)
# Make the person younger again
example.loc[56210,"AGE_AT_INCIDENT"] = 0.489372
modelCB.predict(example)
# Change crime from "Posession Of a Controlled" Substance to "Armed Robbery"
example.loc[56210,"OFFENSE_CATEGORY"] = "Armed Robbery"
example.loc[56210,"DISPOSITION_CHARGED_OFFENSE_TITLE"] = "ARMED ROBBERY"
example.loc[56210,"UPDATED_OFFENSE_CATEGORY"] = "Armed Robbery"
example.loc[56210,"DISPOSITION_CHARGED_CHAPTER"] = "720"
example.loc[56210,"DISPOSITION_CHARGED_ACT"] = "5"
example.loc[56210,"DISPOSITION_CHARGED_SECTION"] = "18-2(a)(2)"
example.loc[56210,"DISPOSITION_CHARGED_CLASS"] = "X"
modelCB.predict(example)