Predicting Sentence Type from Data on the Crime¶

The objective of this notebook is to attempt to predict the type of sentence given information on the crime using machine learning. See below for examples of the inputs and outputs.

# note: to run the two cells below, you will first have to load the dataset (in a cell below) 
# This is an example of the input data that we have to make the prediction (X/input)
sentencing_processed.sample(2).drop(target_raw+target_processed, axis=1)

# These are all of the Sentence Types that can be given (y/output/target)
list(sentencing_processed["categorical_sentence"].unique())

Results and discussion¶

Test scores¶

DummyClassifier - 0.58
CatBoost - 0.69

The improvement in accuracy when using CatBoost is fairly modest (about 10 percentage points.) This might be because the information in each row is not sufficient to discern the true gravity of a crime.

An example is UPDATED_OFFENSE_CATEGORY. Two cases might have the same value in this feature but may be vastly different:

For instance 'Burglary' may be tried as a misdemeanour (which usually has a prison sentence < 1 year) if it was in a warehouse, did not involve weapons and the value of goods stolen was low.
On the other hand, if a Burglary was carried out with weapons, while injuring innocent bystanders and had a large monetary value, then it will very likely be tried as a felony with a minimum of 1 year in prison.

Model Information¶

Model Selection¶

I have experimented with CatBoost, XGBoost and SKLearn's Logistic Regression and RandomForestClassifier.
I was only able to run train on the entire dataset in a reasonable amount of time using CatBoost which implements gradient boosted decision trees and is well known for its excellent performance. A unique feature of CatBoost is that it deals with categorical variables in a special way (hence the name 'Cat' in CatBoost) and does not require one-hot encoding. This is particularly useful for this dataset since most of the features are categorical and one-hot encoding produces about ~6000 features. Also, CatBoost can run on the GPU which makes it very fast.
I tried XGBoost on a smaller subset of the dataset. XGBoost was fairly fast on GPU, but I was unable to run it using the entire dataset on without Colab crashing due to insufficient RAM. I believe this might be because XGBoost uses the one-hot encoded data and the large number of features causes a prooblem. Accuracy wise, XGBoost performed very similarly to CatBoost (both without tuning).
I also tried SKLearn LogisticRegression and RandomForestClassifier. Both of these algorithms are too slow to run on the entire dataset because they do not have GPU support.
Summary of test scores:

# Training Examples	Model	Test Accuracy (default hyper-parameters)	Time to run
10,000	DummyClassifier	0.58	0s
10,000	LogisticRegression	0.65	1min 41s
10,000	RandomForest	0.64	26s
10,000	XGBoost	0.67	11s
10,000	CatBoost	0.67	15s

50,000	DummyClassifier	0.58	0s
50,000	RandomForest	0.67	2min 55s
50,000	XGBoost	0.67	1min 13s
50,000	CatBoost	0.69	20s

100,000	DummyClassifier	0.58	0s
100,000	CatBoost	0.69	22s
100,000	XGBoost	0.67	1min 54s

184,000	DummyClassifier	0.58	0s
184,000	CatBoost	0.69	23s

Hyper-Parameter Tuning¶

Since CatBoost was the only model that works with the entire dataset in a reasonable amount of time, it was the only model that I attempted to tune.
I used RandomizedSearchCV, but even with 15 different hyper-parameters, I was unable to beat accuracy from the default hyper-parameters.
CatBoost is known for picking good default hyper parameters. This is what the lead Developer for CatBoost has said about hyper parameter tuning:
... [CatBoost] is very stable to changing hyperparameters when you have enough training data. It usually provides almost optimal results with default parameters, so one could save some time on parameter tunning.
I did not attempt hyper-parameter tuning on the other models since I could not run them using the entire dataset.

Feature Importance Analyzed Using SHAP ¶

SHAP is a popular way to analyse why how a model behaves. This is what the documentation says about it:
... [SHAP] is a game theoretic approach to explain the output of any machine learning model.
There was nothing particular that stood out. The features seem to affect the sentence type as you would expect.
The only mildly interesting observation is that Race and Gender are not particularly important in predicting the sentence type.

Effect of Pertubing Input on Predicted Sentence Type¶

This is a test with a sample size of 1, so you can't really make a conclusion, but, perturbing the demographic data of example seemed to affect the output in exactly the ways I would expect.
Unfortunately, this indicates a bias in the sentencing which theoretically should not exist if the sentencing system is fair.
This goes against the conclusion from the findings from the SHAP analysis that features like Race and Gender are less significant in predicting the sentence type.
Perhaps this is true when comparing race and gender to features like the crime commited.
However, in this example, the demographic changes were enough to change the sentence type.
Summary of experiment:

Action	Feature	Updated Value	Updated Sentence Category
Start	Start	Black Male Age z=0.5 "Possesion of Controlled Substance"	Prison
Change	Race	White	Probation
Revert	Race	Black	Prison
Change	Gender	Female	Probation
Increase	Age	z=0.9	Prison
Revert	Age	z=0.5	Probation
Change	Crime	"Armed Robbery"	Prison

Code in Appendix at bottom of Notebook.

Further Work¶

An obvious next step would be to predict the duration of the predicted sentence. For example, once you predict a Prison sentence, what should its duration be?
I was intending to do this but could not because of time constraints.
The data has already been preprocessed with a column called 'sentence_period_years' for use in this step.
It would be interesting to do more of the pertubing input experiments to see if my findings for the one experiment hold statistically for the dataset in general.
The performance of the non-CatBoost models on the entire dataset and with tuning is unknown. Since the accuracy scores were very close between all of the models, and since the other models do not advertize that they pick good default hyper parameters, it is possible that they might do better than CatBoost.

Important Information¶

This project was fairly complex and the subject matter is rather serious.
Please read the disclaimers and warnings in Readme.md before drawing any conclusions.

Notice on Running this Notebook¶

Must run on CatBoost on a GPU or will take hours to run. You can run it for free (which is what I did) on Google Colab.
Click here for instructions on how to enable GPU on Google Colab.

import numpy as np
import pandas as pd
from datetime import datetime
import time

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV

import re
import random
import scipy.stats

import matplotlib.pyplot as plt
! pip install shap
import shap
shap.initjs()

Collecting shap
  Downloading https://files.pythonhosted.org/packages/a8/77/b504e43e21a2ba543a1ac4696718beb500cfa708af2fb57cb54ce299045c/shap-0.35.0.tar.gz (273kB)
     |████████████████████████████████| 276kB 4.6MB/s eta 0:00:01
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from shap) (1.18.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from shap) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from shap) (0.22.2.post1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from shap) (1.0.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.6/dist-packages (from shap) (4.38.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->shap) (0.14.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->shap) (1.12.0)
Building wheels for collected packages: shap
  Building wheel for shap (setup.py) ... done
  Created wheel for shap: filename=shap-0.35.0-cp36-cp36m-linux_x86_64.whl size=394123 sha256=d99fd6e292a222d69a3ee80191a774e9d0b3ef04cca902f8587284412b753425
  Stored in directory: /root/.cache/pip/wheels/e7/f7/0f/b57055080cf8894906b3bd3616d2fc2bfd0b12d5161bcb24ac
Successfully built shap
Installing collected packages: shap
Successfully installed shap-0.35.0

# import scipy.stats

# import matplotlib.pyplot as plt
# %matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.base import BaseEstimator, ClassifierMixin
# from sklearn.feature_selection import RFE, RFECV

! pip install catboost

Collecting catboost
  Downloading https://files.pythonhosted.org/packages/94/ec/12b9a42b2ea7dfe5b602f235692ab2b61ee1334ff34334a15902272869e8/catboost-0.22-cp36-none-manylinux1_x86_64.whl (64.4MB)
     |████████████████████████████████| 64.4MB 47kB/s 
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.0.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.2.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.18.2)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.12.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Installing collected packages: catboost
Successfully installed catboost-0.22

! pip install scikit-optimize

Collecting scikit-optimize
  Downloading https://files.pythonhosted.org/packages/5c/87/310b52debfbc0cb79764e5770fa3f5c18f6f0754809ea9e2fc185e1b67d3/scikit_optimize-0.7.4-py2.py3-none-any.whl (80kB)
     |████████████████████████████████| 81kB 3.3MB/s eta 0:00:011
Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (0.22.2.post1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (1.18.2)
Collecting pyaml>=16.9
  Downloading https://files.pythonhosted.org/packages/15/c4/1310a054d33abc318426a956e7d6df0df76a6ddfa9c66f6310274fb75d42/pyaml-20.4.0-py2.py3-none-any.whl
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (0.14.1)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (1.4.1)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.6/dist-packages (from pyaml>=16.9->scikit-optimize) (3.13)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-20.4.0 scikit-optimize-0.7.4

from sklearn.dummy import DummyClassifier, DummyRegressor

from sklearn.ensemble import RandomForestClassifier#, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor

from catboost import CatBoostClassifier, CatBoostRegressor, Pool, CatBoost
from lightgbm import LGBMClassifier, LGBMRegressor
from xgboost import XGBClassifier, XGBRegressor

# Bizzare problem: It seems that when you import one of lbgm, xgboost, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor, 
# if the below options are enabled, it will wreck any operation on a dataframe (including h df.head(), causing it to hang indefinitely

# pd.options.display.max_rows = None  # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_columns = None  # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_colwidth = 100  # To prevent pandas from concatenating very long columns. Set to 0.

Load Data¶

# Update path before using - Must be given the Cook County Sentencing Dataset _after_ it has been pre-preocessed by Sentencing_data_cleaning.ipynb.
data_path_colab = "/content/drive/My Drive/Colab Notebooks/Sentencing_processed_data.csv"
data_path_local = "Sentencing_processed_data.csv"
sentencing_processed = pd.read_csv(data_path_colab,
                                  parse_dates=["DISPOSITION_DATE", "SENTENCE_DATE",
                                                 "INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE",
                                                 "ARREST_DATE", "ARRAIGNMENT_DATE", "RECEIVED_DATE"],
                                  index_col=0)

Columns (8,9,12,15) have mixed types.Specify dtype option on import or set low_memory=False.

sentencing = sentencing_processed.copy()

# Time encoding for "ARREST_DATE"
sentencing["month"] = sentencing["ARREST_DATE"].apply(lambda x: x.month)

# sin/cos for seasonality
sentencing["month_sin"] = np.sin(2*np.pi*sentencing["month"]/12)
sentencing["month_cos"] = np.cos(2*np.pi*sentencing["month"]/12)

# linear encoding
min_date = min(sentencing["ARREST_DATE"])
sentencing["days_number"] = (sentencing["ARREST_DATE"] - min_date).dt.days

df_train_1, df_test_1 = train_test_split(sentencing, test_size=0.2, random_state=123)
print(len(df_test_1) / (len(df_test_1) + len(df_train_1)))

0.2

len(df_train_1)

184852

y_train_type = df_train_1["categorical_sentence"]  # for categorical_sentence
y_train_length = df_train_1["sentence_period_years"]  # for sentence_period_years

y_test_type = df_test_1["categorical_sentence"]
y_test_length = df_test_1["sentence_period_years"]

Preprocess Data for Model Input¶

# categorize features for preprocessing

# includes all features that are an outcome of the judicial process and not the crime except for any features
# that are used to classify the crime (since I have no other way of knowing what the crime is). 
drop_features = ["CASE_ID", "CASE_PARTICIPANT_ID", "CHARGE_ID", "CHARGE_VERSION_ID", "LENGTH_OF_CASE_in_Days", "SENTENCE_PHASE",
                "SENTENCE_TYPE", "COMMITMENT_TYPE", "CURRENT_SENTENCE", "SENTENCE_JUDGE",
                "CHARGE_DISPOSITION_REASON", "COURT_NAME", "COURT_FACILITY", "RECEIVED_DATE",
                "DISPOSITION_DATE", "SENTENCE_DATE", "INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE", "ARRAIGNMENT_DATE",
                "ARREST_DATE", "month", "DISPOSITION_CHARGED_AOIC"]
# drop length of case since this is information from after sentencing
# Drop sentence/commitment type since it has been merged into categorical_sentence
# CHARGE_DISPOSITION_REASON - too many missing features
# ARREST_DATE is dropped because we use the processed date features. month is a intermediary.

numeric_features = ["AGE_AT_INCIDENT", "month_sin", "month_cos", "days_number", "CHARGE_COUNT"]

# features to be one-hot encoded
categorical_features = ["OFFENSE_CATEGORY", "DISPOSITION_CHARGED_OFFENSE_TITLE", "CHARGE_DISPOSITION",
                        "GENDER", "RACE", "UPDATED_OFFENSE_CATEGORY",
                        "DISPOSITION_CHARGED_CHAPTER", "DISPOSITION_CHARGED_ACT", "DISPOSITION_CHARGED_SECTION",
                        "DISPOSITION_CHARGED_CLASS", "INCIDENT_CITY", "LAW_ENFORCEMENT_AGENCY", "UNIT",
                        "PRIMARY_CHARGE"] #use one-hot encoding with drop first
# UNIT is department of police force which is involved
# AOIC refers to Administrative Office of the Illinois Courts ID
# PRIMARY_CHARGE is boolean

# ordinal encoding
ordinal_features = []


# what we are predicting (y)
target_raw = ["COMMITMENT_TERM", "COMMITMENT_UNIT"] # raw target; will be dropped
target_processed = ["categorical_sentence", "sentence_period_years"]

drop_features = drop_features + target_raw + target_processed

temp_a  = list(drop_features + numeric_features + categorical_features + ordinal_features)
temp_a.sort()

temp_b = list(sentencing.columns)
temp_b.sort()

assert (temp_a == temp_b), "Columns do not match"

# Drop target columns - skip if running again
# df_train = df_train.drop(columns=target_raw+target_processed, axis=1, errors='ignore')
# df_test = df_test.drop(columns=target_raw+target_processed, axis=1, errors='ignore')

df_train = df_train_1[numeric_features+categorical_features]
df_test = df_test_1[numeric_features+categorical_features]


df_train[numeric_features] = df_train[numeric_features].astype('float')  # ensure all numeric fields are float
# df_train["PRIMARY_CHARGE"] = df_train["PRIMARY_CHARGE"].astype(str)  # convert boolean to string
df_train[categorical_features] = df_train[categorical_features].astype(str)  # ensure no floats

df_test[numeric_features] = df_test[numeric_features].astype('float')
# df_test["PRIMARY_CHARGE"] = df_test["PRIMARY_CHARGE"].astype(str)
df_test[categorical_features] = df_test[categorical_features].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# Note: The code above will lead to missing data for predicting the sentence duration as you should know what kind of sentence is given.
# Ignoring for now, but this means that I should not predict duration without fixing this

categorical_transformer_cat = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
])

categorical_transformer_ohe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

numeric_transformer_cat = Pipeline([
    ('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
    ('scaler', StandardScaler())
])

numeric_transformer_ohe = Pipeline([
    ('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
    ('scaler', StandardScaler())
])

preprocessor_cat = ColumnTransformer([
    ('numeric', numeric_transformer_cat, numeric_features),
    ('categorical', categorical_transformer_cat, categorical_features)
], remainder='drop')

preprocessor_ohe = ColumnTransformer([
    ('numeric', numeric_transformer_ohe, numeric_features),
    ('categorical', categorical_transformer_ohe, categorical_features)
], remainder='drop')

df_train.isna().sum()

AGE_AT_INCIDENT                      2383
month_sin                            3752
month_cos                            3752
days_number                          3752
CHARGE_COUNT                            0
OFFENSE_CATEGORY                        0
DISPOSITION_CHARGED_OFFENSE_TITLE       0
CHARGE_DISPOSITION                      0
GENDER                                  0
RACE                                    0
UPDATED_OFFENSE_CATEGORY                0
DISPOSITION_CHARGED_CHAPTER             0
DISPOSITION_CHARGED_ACT                 0
DISPOSITION_CHARGED_SECTION             0
DISPOSITION_CHARGED_CLASS               0
INCIDENT_CITY                           0
LAW_ENFORCEMENT_AGENCY                  0
UNIT                                    0
PRIMARY_CHARGE                          0
dtype: int64

preprocessor_ohe.fit(df_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numeric',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),
                                 ['...
                                                                handle_unknown='ignore',
                                                                sparse=False))],
                                          verbose=False),
                                 ['OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_OFFENSE_TITLE',
                                  'CHARGE_DISPOSITION', 'GENDER', 'RACE',
                                  'UPDATED_OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_CHAPTER',
                                  'DISPOSITION_CHARGED_ACT',
                                  'DISPOSITION_CHARGED_SECTION',
                                  'DISPOSITION_CHARGED_CLASS', 'INCIDENT_CITY',
                                  'LAW_ENFORCEMENT_AGENCY', 'UNIT',
                                  'PRIMARY_CHARGE'])],
                  verbose=False)

preprocessor_cat.fit(df_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numeric',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),
                                 ['...
                                                                strategy='constant',
                                                                verbose=0))],
                                          verbose=False),
                                 ['OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_OFFENSE_TITLE',
                                  'CHARGE_DISPOSITION', 'GENDER', 'RACE',
                                  'UPDATED_OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_CHAPTER',
                                  'DISPOSITION_CHARGED_ACT',
                                  'DISPOSITION_CHARGED_SECTION',
                                  'DISPOSITION_CHARGED_CLASS', 'INCIDENT_CITY',
                                  'LAW_ENFORCEMENT_AGENCY', 'UNIT',
                                  'PRIMARY_CHARGE'])],
                  verbose=False)

preprocessor_cat.fit(df_train)
preprocessor_ohe.fit(df_train)

ohe = preprocessor_ohe.named_transformers_['categorical'].named_steps['onehot']
ohe_feature_names = list(ohe.get_feature_names(categorical_features))

new_columns_cat = numeric_features + categorical_features
new_columns_ohe = numeric_features + ohe_feature_names

X_train_cat = pd.DataFrame(preprocessor_cat.transform(df_train), index=df_train.index, columns=new_columns_cat)
X_test_cat  = pd.DataFrame(preprocessor_cat.transform(df_test), index=df_test.index,  columns=new_columns_cat)

X_train_ohe = pd.DataFrame(preprocessor_ohe.transform(df_train), index=df_train.index, columns=new_columns_ohe)
X_test_ohe  = pd.DataFrame(preprocessor_ohe.transform(df_test), index=df_test.index,  columns=new_columns_ohe)

regex = re.compile(r"\[|\]|<", re.IGNORECASE)
# replace any [, ], < in feature name since XGBoost has problems with it
X_train_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train_ohe.columns.values]
X_test_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test_ohe.columns.values]

# replace special characters since LightBGM has problems otherwise - the code below causes issues with XGBoost because it creates non unique features
# X_train_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_train_ohe.columns]
# X_test_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_test_ohe.columns]

Note that I am not removing DISPOSITION_CHARGED_OFFENSE_TITLE and UPDATED_OFFENSE_CATEGORY, both of which are open to interpretation and therefore depend on the judgement to some extent.
Also leaving CHARGE_DISPOSITION. All cases in this dataset were convicted but the sentence might depend on whether the defendant plead guilty or not.

Train and Run Models¶

modelDC = DummyClassifier(strategy="most_frequent")

# All of the non CatBoost models are disabled because they only run on CPU and
# are too slow (takes about 1 day). You can comment out these lines to run them
# but be sure to initially run on a sample of n=1000 or less. 

# modelLR = LogisticRegression(max_iter=1000)

# modelXG = XGBClassifier()

# modelLB = LGBMClassifier()

# Note change task_type to 'CPU' if computer does not have an NVIDIA graphics card and CUDA installed.
# Note: running on CPU is about 400 times slower (and will take about a day).
modelCB = CatBoostClassifier(cat_features=categorical_features, task_type="GPU", verbose=1000)

Run Time with n=100

LogisticRegression - 0.5s (1)
XGBoost - 10s (20)
CatBoost - 90s (180)

Logistic regression n=10000 - 80s

n=len(X_train_cat) # full dataset
# Removing all except Catboost since they are too slow for running on the whole dataset.

print("DummyClassifier")
%timeit -n1 -r1 modelDC.fit(X_train_ohe.head(n), y_train_type.head(n))

# print("\nLogisticRegression")
# %timeit -n1 -r1 modelLR.fit(X_train_ohe.head(n), y_train_type.head(n))

# print("\nXGBoost")
# %timeit -n1 -r1 modelXG.fit(X_train_ohe.head(n), y_train_type.head(n))

print("\nCatBoost")
%timeit -n1 -r1 modelCB.fit(X_train_cat.head(n), y_train_type.head(n))


# Has problems with the feature names
# print("\nLightBGM")
# %timeit -n1 -r1 modelLB.fit(X_train_ohe.head(n), y_train_type.head(n))

DummyClassifier
1 loop, best of 1: 131 ms per loop

CatBoost
Learning rate set to 0.188242
0:	learn: 1.6173931	total: 24.4ms	remaining: 24.4s
999:	learn: 0.6428163	total: 20.5s	remaining: 0us
1 loop, best of 1: 22.5 s per loop

# Print default catboost parameters
# modelCB.get_all_params()

print("DummyClassifier")
print(f"Train: {modelDC.score(X_train_ohe.head(n), y_train_type.head(n))}")
print(f"Test: {modelDC.score(X_test_ohe.head(n), y_test_type.head(n))}")

# print("\nLogisticRegression")
# print(f"Train: {modelLR.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLR.score(X_test_ohe.head(n), y_test_type.head(n))}")

# print("\nXGBoost")
# print(f"Train: {modelXG.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelXG.score(X_test_ohe.head(n), y_test_type.head(n))}")

print("\nCatBoost")
print(f"Train: {modelCB.score(X_train_cat.head(n), y_train_type.head(n))}")
print(f"Test: {modelCB.score(X_test_cat.head(n), y_test_type.head(n))}")


# print("\LightBGM")
# print(f"Train: {modelLB.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLB.score(X_test_ohe.head(n), y_test_type.head(n))}")

DummyClassifier
Train: 0.5798097937809708
Test: 0.580421093631662

CatBoost
Train: 0.7223941315214334
Test: 0.6939172959989613

Shap Analysis of the Effect of Features on Sentence Type¶

Importance of Feature in Determining Prediction¶

From the bar chart below, the feature which most affects the outcome is UPDATED_OFFENSE_CATEGORY and OFFENSE_CATEGORY which is expected since this is what describes the crime.
Days_number also seems to be important. This is the number of days from an arbitrary starting point to the date of the incident.
AGE_AT_INCIDENT also seems to be an important factor in deciding the type of sentence.
It is nice to see that RACE is not at the top of the features that determine the sentence type.

explainer = shap.TreeExplainer(modelCB)
shap_values = explainer.shap_values(X_train_cat)
# shap_values.shape
shap.initjs()

Setting feature_perturbation = "tree_path_dependent" because no background data was given.

# order was verified manually by looking at which column of predict_proba was the highest
class_labels = y_train_type.unique()
class_labels.sort()
class_labels

array(['Boot Camp', 'Conditional Discharge', 'Court Supervision',
       'Drug Court Probation', 'Drug School', 'Gang Probation',
       'Intensive Drug Probation Services',
       'Intensive Probation Services', 'Juvenile IDOC', 'Life', 'Prison',
       'Probation', 'Sex Offender Probation'], dtype=object)

shap.summary_plot(shap_values, X_train_cat, class_names=class_labels)

How Features Affect Likelihood of a Predicting a Given Class¶

Notes:

The subplots at the top are the most important
Only numeric features have colour in the subplot.
- For numeric features, the x-axis shows whether the feature is more (to the right) or less (to the left) likely to be predicted (i.e. the predict_proba score for this feature).
The only features that make sense in these plots are:
- AGE_AT_INCIDENT
- charge_count (number of other charges associated with a single trial. For example, a person would have 2 associated charges he commits burglary and in the process also manages to seriously injure someone else)
Note: Ignore days_number (an integer value of the date of arrest from some starting date) and all of the other features.

Below, I have analyzed feature importance for 3 out of the 13 possible classes in the target (categorical_sentence).

For predicting bootcamp, the most important feature is AGE_AT_INCIDENT. Younger people are very rarely get a BootCamp sentence.

The effect of age on the likelihood of being sentenced to Life is not monotomic (there is no clear gradation from red to blue).
Charge count very clearly affects probaility of getting Life sentences, with more charges making Life sentence more likely.

class_number = 9
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)

Shap plot for prediction class - categorical_sentence=Life

Age very strongly affects the likelihood of being sentenced to Drug Court Probation, with older people being much more likely to be given this sentence.
Also, from the charge count, it is clear that the more charges associated with a person, the less likely he is to be sentenced to Drug Court Probation (and therefore given a much more severe punishment).

class_number = 3
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)

Shap plot for prediction class - categorical_sentence=Drug Court Probation

Tuning Hyper-Parameters with RandomizedSearchCV¶

I tested this for CatBoost and it seems that it is not worth to optimize hyper paramers since CatBoost automatically selects very good hyper-paramers (this is a feature of CatBoost).
Even running tens of folds with pparameters faily close to the values CatBoost selects, the best score was still slightly lower than for the automatic parameters.
Leaving the code below for reference.

https://effectiveml.com/using-grid-search-to-optimise-catboost-parameters.html
CatBoost hyperparameters max reasonable range:

n_estimators=[0,5000]
learning_rate=[0.0001,1]
l2_leaf_reg=[0,100]
bootstrap_type=["Bayesian", "Bernoulli", "Poisson", "No"]
one_hot_max_size=[1,255]
max_depth=[1,16]
random_strength=[0,500]
border_count = [0,255]
ctr_border_count = [0,255]

Recommended parameter space:
https://github.com/talperetz/hyperspace/tree/master/GBDTs

Another good resource:
https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2

# catboost_params_space = {
#     "n_estimators" : scipy.stats.randint(low=500, high=1500),  # too large takes too long to run
#     "learning_rate": [0.001, 0.01, 0.1, 1],
#     "l2_leaf_reg": scipy.stats.randint(low=1, high=100),
#     "bootstrap_type": ["Bernoulli", "No", "Poisson", "Bayesian"],
#     "one_hot_max_size": scipy.stats.randint(low=10, high=50),
#     "max_depth": scipy.stats.randint(low=1, high=11),  # too large takes too long to run
#     "random_strength": scipy.stats.randint(low=0, high=50),
#     "border_count": scipy.stats.randint(low=100, high=255),
# }

# Ran above configuration for 15 folds and it was worse than default parameters
# which means that catboost does a good job selecting default parameters.
# Trying to optimize only a few parameters that are known to be important.

catboost_params_space = {
    "n_estimators" : scipy.stats.randint(low=500, high=1500), # default=0.188242, too large takes too long to run
    "one_hot_max_size": scipy.stats.randint(low=2, high=20),  # default=2
    "max_depth": scipy.stats.randint(low=3, high=11),         # default=6 too, large takes too long to run
}

# Randomized Search. Do not run in parallel if using CatBoost.
random_search = RandomizedSearchCV(modelCB, param_distributions = catboost_params_space, 
                                   n_iter = 5, cv=2, verbose=2,
                                   random_state=123, n_jobs=1)

# Uncomment this code to run. Warning will take a long time even when running on GPU.

# n=len(X_train_cat)
# %timeit -n1 -r1 random_search.fit(X_train_cat.head(n), y_train_type.head(n));

# print(f"Best Parameters - {random_search.best_params_}")
# print(f"Best Score - {random_search.best_score_}")

Appendix¶

# Each crime may have multiple charges, so must ensure that two cases from the same crime do not end up in different tain/test splits
# because this will lead to data leaking to the test set.

# Note that during cross validation, I ignore this because it is too much trouble to do the splits properly, so the train scores will likely be inflated
# compared to the test scores, but since the test split is done correctly, the test score will be valid.

# Technically, I shouldn't need to do this if my understanding of how sentencing works (each case gets an independent sentence),
# but doing this to be on the safe side.

# sentencing_sorted = sentencing.sort_values(by=["CASE_ID"])
# random.seed(a=123, version=2)  # for reproduceability
# df_train = pd.DataFrame()
# df_test = pd.DataFrame()
# split_ratio = 0.2

# # estimated to take exactly 1 hour on 233k examples
# previous_id = 0
# last_set = 0; # "train" or "test"

# total_length = len(sentencing_sorted)

# for i in range(len(sentencing_sorted)):
#     if (i%100 == 0): print(i, time.time())
#     curr_line = sentencing_sorted.iloc[i]
#     if curr_line["CASE_ID"] == previous_id:
#         df_train.append(curr_line) if (last_set=="train") else df_test.append(curr_line)
#     else:
#         # sample random number to decide which dataset
#         if (random.random() < split_ratio):
#             last_set = "train"
#             df_train.append(curr_line)
#         else:
#             last_set = "test"
#             df_test.append(curr_line)
            
# print("Done split")

# do nothing estimator; https://scikit-learn.org/stable/developers/develop.html
# Useful if you want to use remainder="drop" when building ColumnTransformer,
# but you still want to keep select features without processing them.
class Nothing(BaseEstimator, ClassifierMixin):

    def __init__(self, demo_param='demo'):
        self.demo_param = demo_param

    def fit(self, X, y):
        # Do nothing
        print()
        return self

    def predict(self, X):
        # Do nothing
        return None

    def transform(self, data):
        # return data without doing anything
        return data

passthrough_transformer = Pipeline([
    ('do_nothing', Nothing())
])

Something to try which might improve the model:

Try dropping all rows for a given "CASE_ID" except for the row which had PRIMARY_CHARGE=True.

Testing on Some Synthetic Data¶

example = X_train_cat.head(1).copy()

example

# Prediction at start
modelCB.predict(example.head(1))

array([['Prison']], dtype=object)

# Change race from Black to White:
example.loc[56210,"RACE"] = "White"
modelCB.predict(example)

array([['Probation']], dtype=object)

# Change race back to Black:
example.loc[56210,"RACE"] = "Black"
modelCB.predict(example)

array([['Prison']], dtype=object)

# Change Gender from Male to Female:
example.loc[56210,"GENDER"] = "Female"
modelCB.predict(example)

array([['Probation']], dtype=object)

# Make the person older than current age z-score=0.489372
example.loc[56210,"AGE_AT_INCIDENT"] = 0.9
modelCB.predict(example)

array([['Prison']], dtype=object)

# Make the person younger again
example.loc[56210,"AGE_AT_INCIDENT"] = 0.489372
modelCB.predict(example)

array([['Probation']], dtype=object)

# Change crime from "Posession Of a Controlled" Substance to "Armed Robbery"
example.loc[56210,"OFFENSE_CATEGORY"] = "Armed Robbery" 
example.loc[56210,"DISPOSITION_CHARGED_OFFENSE_TITLE"] = "ARMED ROBBERY"
example.loc[56210,"UPDATED_OFFENSE_CATEGORY"] = "Armed Robbery"
example.loc[56210,"DISPOSITION_CHARGED_CHAPTER"] = "720"
example.loc[56210,"DISPOSITION_CHARGED_ACT"] = "5"
example.loc[56210,"DISPOSITION_CHARGED_SECTION"] = "18-2(a)(2)"
example.loc[56210,"DISPOSITION_CHARGED_CLASS"] = "X"
modelCB.predict(example)

array([['Prison']], dtype=object)

	CASE_ID	CASE_PARTICIPANT_ID	OFFENSE_CATEGORY	PRIMARY_CHARGE	CHARGE_ID	CHARGE_VERSION_ID	DISPOSITION_CHARGED_OFFENSE_TITLE	DISPOSITION_CHARGED_CHAPTER	DISPOSITION_CHARGED_ACT	DISPOSITION_CHARGED_SECTION	DISPOSITION_CHARGED_CLASS	DISPOSITION_CHARGED_AOIC	DISPOSITION_DATE	CHARGE_DISPOSITION	CHARGE_DISPOSITION_REASON	SENTENCE_PHASE	SENTENCE_DATE	SENTENCE_JUDGE	SENTENCE_TYPE	CURRENT_SENTENCE	COMMITMENT_TYPE	COURT_NAME	COURT_FACILITY	LENGTH_OF_CASE_in_Days	AGE_AT_INCIDENT	GENDER	RACE	INCIDENT_BEGIN_DATE	INCIDENT_END_DATE	ARREST_DATE	LAW_ENFORCEMENT_AGENCY	UNIT	INCIDENT_CITY	RECEIVED_DATE	ARRAIGNMENT_DATE	UPDATED_OFFENSE_CATEGORY	CHARGE_COUNT
151276	122808266180	990745020870	Narcotics	True	2294141090346	521928722737	[POSSESSION OF CONTROLLED SUBSTANCE WITH INTEN...	720	570	407(b)(2)	1	5125392	5/18/2016 12:00:00 AM	Plea Of Guilty	NaN	Original Sentencing	5/18/2016 12:00:00 AM	Rickey Jones	Prison	True	Illinois Department of Corrections	District 1 - Chicago	26TH Street	414.0	46.0	Male	Black	2015-02-23	NaT	2015-02-23 18:18:00	CHICAGO PD	District 11 - Harrison	Chicago	2015-02-25	3/31/2015 12:00:00 AM	Narcotics	1
59927	116535666739	911739066938	Armed Robbery	True	2085185279595	473795983033	ARMED ROBBERY	720	5	18-2(a)(2)	X	0012366	4/7/2014 12:00:00 AM	Finding Guilty	NaN	Original Sentencing	4/7/2014 12:00:00 AM	Matthew E Coghlan	Prison	True	Illinois Department of Corrections	District 1 - Chicago	26TH Street	738.0	18.0	Male	Black	2011-11-21	NaT	2012-02-16 12:35:00	CHICAGO PD	NaN	Chicago	2011-12-13	3/30/2012 12:00:00 AM	Armed Robbery	1