Predicting Sentence Type from Data on the Crime

The objective of this notebook is to attempt to predict the type of sentence given information on the crime using machine learning. See below for examples of the inputs and outputs.

In [26]:
# note: to run the two cells below, you will first have to load the dataset (in a cell below) 
# This is an example of the input data that we have to make the prediction (X/input)
sentencing_processed.sample(2).drop(target_raw+target_processed, axis=1)
Out[26]:
CASE_ID CASE_PARTICIPANT_ID OFFENSE_CATEGORY PRIMARY_CHARGE CHARGE_ID CHARGE_VERSION_ID DISPOSITION_CHARGED_OFFENSE_TITLE DISPOSITION_CHARGED_CHAPTER DISPOSITION_CHARGED_ACT DISPOSITION_CHARGED_SECTION DISPOSITION_CHARGED_CLASS DISPOSITION_CHARGED_AOIC DISPOSITION_DATE CHARGE_DISPOSITION CHARGE_DISPOSITION_REASON SENTENCE_PHASE SENTENCE_DATE SENTENCE_JUDGE SENTENCE_TYPE CURRENT_SENTENCE COMMITMENT_TYPE COURT_NAME COURT_FACILITY LENGTH_OF_CASE_in_Days AGE_AT_INCIDENT GENDER RACE INCIDENT_BEGIN_DATE INCIDENT_END_DATE ARREST_DATE LAW_ENFORCEMENT_AGENCY UNIT INCIDENT_CITY RECEIVED_DATE ARRAIGNMENT_DATE UPDATED_OFFENSE_CATEGORY CHARGE_COUNT
151276 122808266180 990745020870 Narcotics True 2294141090346 521928722737 [POSSESSION OF CONTROLLED SUBSTANCE WITH INTEN... 720 570 407(b)(2) 1 5125392 5/18/2016 12:00:00 AM Plea Of Guilty NaN Original Sentencing 5/18/2016 12:00:00 AM Rickey Jones Prison True Illinois Department of Corrections District 1 - Chicago 26TH Street 414.0 46.0 Male Black 2015-02-23 NaT 2015-02-23 18:18:00 CHICAGO PD District 11 - Harrison Chicago 2015-02-25 3/31/2015 12:00:00 AM Narcotics 1
59927 116535666739 911739066938 Armed Robbery True 2085185279595 473795983033 ARMED ROBBERY 720 5 18-2(a)(2) X 0012366 4/7/2014 12:00:00 AM Finding Guilty NaN Original Sentencing 4/7/2014 12:00:00 AM Matthew E Coghlan Prison True Illinois Department of Corrections District 1 - Chicago 26TH Street 738.0 18.0 Male Black 2011-11-21 NaT 2012-02-16 12:35:00 CHICAGO PD NaN Chicago 2011-12-13 3/30/2012 12:00:00 AM Armed Robbery 1
In [19]:
# These are all of the Sentence Types that can be given (y/output/target)
list(sentencing_processed["categorical_sentence"].unique())
Out[19]:
['Life', 'Boot Camp', 'Prison', 'Probation', 'Court Supervision', 'Conditional Discharge', 'Intensive Probation Services', 'Drug Court Probation', 'Intensive Drug Probation Services', 'Gang Probation', 'Sex Offender Probation', 'Drug School', 'Juvenile IDOC']

Results and discussion

Test scores

  • DummyClassifier - 0.58
  • CatBoost - 0.69

The improvement in accuracy when using CatBoost is fairly modest (about 10 percentage points.) This might be because the information in each row is not sufficient to discern the true gravity of a crime.

An example is UPDATED_OFFENSE_CATEGORY. Two cases might have the same value in this feature but may be vastly different:

  • For instance 'Burglary' may be tried as a misdemeanour (which usually has a prison sentence < 1 year) if it was in a warehouse, did not involve weapons and the value of goods stolen was low.
  • On the other hand, if a Burglary was carried out with weapons, while injuring innocent bystanders and had a large monetary value, then it will very likely be tried as a felony with a minimum of 1 year in prison.

Model Information

Model Selection

  • I have experimented with CatBoost, XGBoost and SKLearn's Logistic Regression and RandomForestClassifier.
  • I was only able to run train on the entire dataset in a reasonable amount of time using CatBoost which implements gradient boosted decision trees and is well known for its excellent performance. A unique feature of CatBoost is that it deals with categorical variables in a special way (hence the name 'Cat' in CatBoost) and does not require one-hot encoding. This is particularly useful for this dataset since most of the features are categorical and one-hot encoding produces about ~6000 features. Also, CatBoost can run on the GPU which makes it very fast.
  • I tried XGBoost on a smaller subset of the dataset. XGBoost was fairly fast on GPU, but I was unable to run it using the entire dataset on without Colab crashing due to insufficient RAM. I believe this might be because XGBoost uses the one-hot encoded data and the large number of features causes a prooblem. Accuracy wise, XGBoost performed very similarly to CatBoost (both without tuning).
  • I also tried SKLearn LogisticRegression and RandomForestClassifier. Both of these algorithms are too slow to run on the entire dataset because they do not have GPU support.
  • Summary of test scores:
# Training Examples Model Test Accuracy (default hyper-parameters) Time to run
10,000 DummyClassifier 0.58 0s
10,000 LogisticRegression 0.65 1min 41s
10,000 RandomForest 0.64 26s
10,000 XGBoost 0.67 11s
10,000 CatBoost 0.67 15s
50,000 DummyClassifier 0.58 0s
50,000 RandomForest 0.67 2min 55s
50,000 XGBoost 0.67 1min 13s
50,000 CatBoost 0.69 20s
100,000 DummyClassifier 0.58 0s
100,000 CatBoost 0.69 22s
100,000 XGBoost 0.67 1min 54s
184,000 DummyClassifier 0.58 0s
184,000 CatBoost 0.69 23s

Hyper-Parameter Tuning

  • Since CatBoost was the only model that works with the entire dataset in a reasonable amount of time, it was the only model that I attempted to tune.
  • I used RandomizedSearchCV, but even with 15 different hyper-parameters, I was unable to beat accuracy from the default hyper-parameters.
  • CatBoost is known for picking good default hyper parameters. This is what the lead Developer for CatBoost has said about hyper parameter tuning:

    ... [CatBoost] is very stable to changing hyperparameters when you have enough training data. It usually provides almost optimal results with default parameters, so one could save some time on parameter tunning.

  • I did not attempt hyper-parameter tuning on the other models since I could not run them using the entire dataset.

Feature Importance Analyzed Using SHAP

  • SHAP is a popular way to analyse why how a model behaves. This is what the documentation says about it:

    ... [SHAP] is a game theoretic approach to explain the output of any machine learning model.

  • There was nothing particular that stood out. The features seem to affect the sentence type as you would expect.
  • The only mildly interesting observation is that Race and Gender are not particularly important in predicting the sentence type.

Effect of Pertubing Input on Predicted Sentence Type

  • This is a test with a sample size of 1, so you can't really make a conclusion, but, perturbing the demographic data of example seemed to affect the output in exactly the ways I would expect.
  • Unfortunately, this indicates a bias in the sentencing which theoretically should not exist if the sentencing system is fair.
  • This goes against the conclusion from the findings from the SHAP analysis that features like Race and Gender are less significant in predicting the sentence type.
    Perhaps this is true when comparing race and gender to features like the crime commited.
    However, in this example, the demographic changes were enough to change the sentence type.
  • Summary of experiment:
Action Feature Updated Value Updated Sentence Category
Start Start Black
Male
Age z=0.5
"Possesion of Controlled Substance"
Prison
Change Race White Probation
Revert Race Black Prison
Change Gender Female Probation
Increase Age z=0.9 Prison
Revert Age z=0.5 Probation
Change Crime "Armed Robbery" Prison
  • Code in Appendix at bottom of Notebook.

Further Work

  • An obvious next step would be to predict the duration of the predicted sentence. For example, once you predict a Prison sentence, what should its duration be?
    I was intending to do this but could not because of time constraints.
    The data has already been preprocessed with a column called 'sentence_period_years' for use in this step.
  • It would be interesting to do more of the pertubing input experiments to see if my findings for the one experiment hold statistically for the dataset in general.
  • The performance of the non-CatBoost models on the entire dataset and with tuning is unknown. Since the accuracy scores were very close between all of the models, and since the other models do not advertize that they pick good default hyper parameters, it is possible that they might do better than CatBoost.

Important Information

This project was fairly complex and the subject matter is rather serious.
Please read the disclaimers and warnings in Readme.md before drawing any conclusions.

Notice on Running this Notebook

  • Must run on CatBoost on a GPU or will take hours to run. You can run it for free (which is what I did) on Google Colab.
  • Click here for instructions on how to enable GPU on Google Colab.
In [ ]:
import numpy as np
import pandas as pd
from datetime import datetime
import time
In [ ]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV
In [ ]:
import re
import random
import scipy.stats
In [8]:
import matplotlib.pyplot as plt
! pip install shap
import shap
shap.initjs()
Collecting shap
  Downloading https://files.pythonhosted.org/packages/a8/77/b504e43e21a2ba543a1ac4696718beb500cfa708af2fb57cb54ce299045c/shap-0.35.0.tar.gz (273kB)
     |████████████████████████████████| 276kB 4.6MB/s eta 0:00:01
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from shap) (1.18.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from shap) (1.4.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from shap) (0.22.2.post1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from shap) (1.0.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.6/dist-packages (from shap) (4.38.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->shap) (0.14.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->shap) (1.12.0)
Building wheels for collected packages: shap
  Building wheel for shap (setup.py) ... done
  Created wheel for shap: filename=shap-0.35.0-cp36-cp36m-linux_x86_64.whl size=394123 sha256=d99fd6e292a222d69a3ee80191a774e9d0b3ef04cca902f8587284412b753425
  Stored in directory: /root/.cache/pip/wheels/e7/f7/0f/b57055080cf8894906b3bd3616d2fc2bfd0b12d5161bcb24ac
Successfully built shap
Installing collected packages: shap
Successfully installed shap-0.35.0
In [ ]:
# import scipy.stats

# import matplotlib.pyplot as plt
# %matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.base import BaseEstimator, ClassifierMixin
# from sklearn.feature_selection import RFE, RFECV
In [10]:
! pip install catboost
Collecting catboost
  Downloading https://files.pythonhosted.org/packages/94/ec/12b9a42b2ea7dfe5b602f235692ab2b61ee1334ff34334a15902272869e8/catboost-0.22-cp36-none-manylinux1_x86_64.whl (64.4MB)
     |████████████████████████████████| 64.4MB 47kB/s 
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.0.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.2.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.18.2)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.12.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Installing collected packages: catboost
Successfully installed catboost-0.22
In [11]:
! pip install scikit-optimize
Collecting scikit-optimize
  Downloading https://files.pythonhosted.org/packages/5c/87/310b52debfbc0cb79764e5770fa3f5c18f6f0754809ea9e2fc185e1b67d3/scikit_optimize-0.7.4-py2.py3-none-any.whl (80kB)
     |████████████████████████████████| 81kB 3.3MB/s eta 0:00:011
Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (0.22.2.post1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (1.18.2)
Collecting pyaml>=16.9
  Downloading https://files.pythonhosted.org/packages/15/c4/1310a054d33abc318426a956e7d6df0df76a6ddfa9c66f6310274fb75d42/pyaml-20.4.0-py2.py3-none-any.whl
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (0.14.1)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from scikit-optimize) (1.4.1)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.6/dist-packages (from pyaml>=16.9->scikit-optimize) (3.13)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-20.4.0 scikit-optimize-0.7.4
In [ ]:
from sklearn.dummy import DummyClassifier, DummyRegressor

from sklearn.ensemble import RandomForestClassifier#, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor

from catboost import CatBoostClassifier, CatBoostRegressor, Pool, CatBoost
from lightgbm import LGBMClassifier, LGBMRegressor
from xgboost import XGBClassifier, XGBRegressor
In [ ]:
# Bizzare problem: It seems that when you import one of lbgm, xgboost, Ridge, Lasso, ElasticNet, SGDClassifier, SGDRegressor, 
# if the below options are enabled, it will wreck any operation on a dataframe (including h df.head(), causing it to hang indefinitely

# pd.options.display.max_rows = None  # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_columns = None  # to stop pandas from not displaying all columns because of screen width
# pd.options.display.max_colwidth = 100  # To prevent pandas from concatenating very long columns. Set to 0.

Load Data

In [17]:
# Update path before using - Must be given the Cook County Sentencing Dataset _after_ it has been pre-preocessed by Sentencing_data_cleaning.ipynb.
data_path_colab = "/content/drive/My Drive/Colab Notebooks/Sentencing_processed_data.csv"
data_path_local = "Sentencing_processed_data.csv"
sentencing_processed = pd.read_csv(data_path_colab,
                                  parse_dates=["DISPOSITION_DATE", "SENTENCE_DATE",
                                                 "INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE",
                                                 "ARREST_DATE", "ARRAIGNMENT_DATE", "RECEIVED_DATE"],
                                  index_col=0)
Columns (8,9,12,15) have mixed types.Specify dtype option on import or set low_memory=False.
In [ ]:
sentencing = sentencing_processed.copy()
In [ ]:
# Time encoding for "ARREST_DATE"
sentencing["month"] = sentencing["ARREST_DATE"].apply(lambda x: x.month)

# sin/cos for seasonality
sentencing["month_sin"] = np.sin(2*np.pi*sentencing["month"]/12)
sentencing["month_cos"] = np.cos(2*np.pi*sentencing["month"]/12)

# linear encoding
min_date = min(sentencing["ARREST_DATE"])
sentencing["days_number"] = (sentencing["ARREST_DATE"] - min_date).dt.days
In [22]:
df_train_1, df_test_1 = train_test_split(sentencing, test_size=0.2, random_state=123)
print(len(df_test_1) / (len(df_test_1) + len(df_train_1)))
0.2
In [23]:
len(df_train_1)
Out[23]:
184852
In [ ]:
y_train_type = df_train_1["categorical_sentence"]  # for categorical_sentence
y_train_length = df_train_1["sentence_period_years"]  # for sentence_period_years

y_test_type = df_test_1["categorical_sentence"]
y_test_length = df_test_1["sentence_period_years"]

Preprocess Data for Model Input

In [ ]:
# categorize features for preprocessing

# includes all features that are an outcome of the judicial process and not the crime except for any features
# that are used to classify the crime (since I have no other way of knowing what the crime is). 
drop_features = ["CASE_ID", "CASE_PARTICIPANT_ID", "CHARGE_ID", "CHARGE_VERSION_ID", "LENGTH_OF_CASE_in_Days", "SENTENCE_PHASE",
                "SENTENCE_TYPE", "COMMITMENT_TYPE", "CURRENT_SENTENCE", "SENTENCE_JUDGE",
                "CHARGE_DISPOSITION_REASON", "COURT_NAME", "COURT_FACILITY", "RECEIVED_DATE",
                "DISPOSITION_DATE", "SENTENCE_DATE", "INCIDENT_BEGIN_DATE", "INCIDENT_END_DATE", "ARRAIGNMENT_DATE",
                "ARREST_DATE", "month", "DISPOSITION_CHARGED_AOIC"]
# drop length of case since this is information from after sentencing
# Drop sentence/commitment type since it has been merged into categorical_sentence
# CHARGE_DISPOSITION_REASON - too many missing features
# ARREST_DATE is dropped because we use the processed date features. month is a intermediary.

numeric_features = ["AGE_AT_INCIDENT", "month_sin", "month_cos", "days_number", "CHARGE_COUNT"]

# features to be one-hot encoded
categorical_features = ["OFFENSE_CATEGORY", "DISPOSITION_CHARGED_OFFENSE_TITLE", "CHARGE_DISPOSITION",
                        "GENDER", "RACE", "UPDATED_OFFENSE_CATEGORY",
                        "DISPOSITION_CHARGED_CHAPTER", "DISPOSITION_CHARGED_ACT", "DISPOSITION_CHARGED_SECTION",
                        "DISPOSITION_CHARGED_CLASS", "INCIDENT_CITY", "LAW_ENFORCEMENT_AGENCY", "UNIT",
                        "PRIMARY_CHARGE"] #use one-hot encoding with drop first
# UNIT is department of police force which is involved
# AOIC refers to Administrative Office of the Illinois Courts ID
# PRIMARY_CHARGE is boolean

# ordinal encoding
ordinal_features = []


# what we are predicting (y)
target_raw = ["COMMITMENT_TERM", "COMMITMENT_UNIT"] # raw target; will be dropped
target_processed = ["categorical_sentence", "sentence_period_years"]

drop_features = drop_features + target_raw + target_processed
In [ ]:
temp_a  = list(drop_features + numeric_features + categorical_features + ordinal_features)
temp_a.sort()

temp_b = list(sentencing.columns)
temp_b.sort()

assert (temp_a == temp_b), "Columns do not match"
In [29]:
# Drop target columns - skip if running again
# df_train = df_train.drop(columns=target_raw+target_processed, axis=1, errors='ignore')
# df_test = df_test.drop(columns=target_raw+target_processed, axis=1, errors='ignore')

df_train = df_train_1[numeric_features+categorical_features]
df_test = df_test_1[numeric_features+categorical_features]


df_train[numeric_features] = df_train[numeric_features].astype('float')  # ensure all numeric fields are float
# df_train["PRIMARY_CHARGE"] = df_train["PRIMARY_CHARGE"].astype(str)  # convert boolean to string
df_train[categorical_features] = df_train[categorical_features].astype(str)  # ensure no floats

df_test[numeric_features] = df_test[numeric_features].astype('float')
# df_test["PRIMARY_CHARGE"] = df_test["PRIMARY_CHARGE"].astype(str)
df_test[categorical_features] = df_test[categorical_features].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [ ]:
# Note: The code above will lead to missing data for predicting the sentence duration as you should know what kind of sentence is given.
# Ignoring for now, but this means that I should not predict duration without fixing this
In [ ]:
categorical_transformer_cat = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
])

categorical_transformer_ohe = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='?')),
    ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
In [ ]:
numeric_transformer_cat = Pipeline([
    ('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
    ('scaler', StandardScaler())
])

numeric_transformer_ohe = Pipeline([
    ('imputer', SimpleImputer(strategy='median', missing_values=np.nan)),
    ('scaler', StandardScaler())
])
In [ ]:
preprocessor_cat = ColumnTransformer([
    ('numeric', numeric_transformer_cat, numeric_features),
    ('categorical', categorical_transformer_cat, categorical_features)
], remainder='drop')

preprocessor_ohe = ColumnTransformer([
    ('numeric', numeric_transformer_ohe, numeric_features),
    ('categorical', categorical_transformer_ohe, categorical_features)
], remainder='drop')
In [34]:
df_train.isna().sum()
Out[34]:
AGE_AT_INCIDENT                      2383
month_sin                            3752
month_cos                            3752
days_number                          3752
CHARGE_COUNT                            0
OFFENSE_CATEGORY                        0
DISPOSITION_CHARGED_OFFENSE_TITLE       0
CHARGE_DISPOSITION                      0
GENDER                                  0
RACE                                    0
UPDATED_OFFENSE_CATEGORY                0
DISPOSITION_CHARGED_CHAPTER             0
DISPOSITION_CHARGED_ACT                 0
DISPOSITION_CHARGED_SECTION             0
DISPOSITION_CHARGED_CLASS               0
INCIDENT_CITY                           0
LAW_ENFORCEMENT_AGENCY                  0
UNIT                                    0
PRIMARY_CHARGE                          0
dtype: int64
In [35]:
preprocessor_ohe.fit(df_train)
Out[35]:
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numeric',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),
                                 ['...
                                                                handle_unknown='ignore',
                                                                sparse=False))],
                                          verbose=False),
                                 ['OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_OFFENSE_TITLE',
                                  'CHARGE_DISPOSITION', 'GENDER', 'RACE',
                                  'UPDATED_OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_CHAPTER',
                                  'DISPOSITION_CHARGED_ACT',
                                  'DISPOSITION_CHARGED_SECTION',
                                  'DISPOSITION_CHARGED_CLASS', 'INCIDENT_CITY',
                                  'LAW_ENFORCEMENT_AGENCY', 'UNIT',
                                  'PRIMARY_CHARGE'])],
                  verbose=False)
In [36]:
preprocessor_cat.fit(df_train)
Out[36]:
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numeric',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True))],
                                          verbose=False),
                                 ['...
                                                                strategy='constant',
                                                                verbose=0))],
                                          verbose=False),
                                 ['OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_OFFENSE_TITLE',
                                  'CHARGE_DISPOSITION', 'GENDER', 'RACE',
                                  'UPDATED_OFFENSE_CATEGORY',
                                  'DISPOSITION_CHARGED_CHAPTER',
                                  'DISPOSITION_CHARGED_ACT',
                                  'DISPOSITION_CHARGED_SECTION',
                                  'DISPOSITION_CHARGED_CLASS', 'INCIDENT_CITY',
                                  'LAW_ENFORCEMENT_AGENCY', 'UNIT',
                                  'PRIMARY_CHARGE'])],
                  verbose=False)
In [ ]:
preprocessor_cat.fit(df_train)
preprocessor_ohe.fit(df_train)

ohe = preprocessor_ohe.named_transformers_['categorical'].named_steps['onehot']
ohe_feature_names = list(ohe.get_feature_names(categorical_features))

new_columns_cat = numeric_features + categorical_features
new_columns_ohe = numeric_features + ohe_feature_names
In [ ]:
X_train_cat = pd.DataFrame(preprocessor_cat.transform(df_train), index=df_train.index, columns=new_columns_cat)
X_test_cat  = pd.DataFrame(preprocessor_cat.transform(df_test), index=df_test.index,  columns=new_columns_cat)

X_train_ohe = pd.DataFrame(preprocessor_ohe.transform(df_train), index=df_train.index, columns=new_columns_ohe)
X_test_ohe  = pd.DataFrame(preprocessor_ohe.transform(df_test), index=df_test.index,  columns=new_columns_ohe)

regex = re.compile(r"\[|\]|<", re.IGNORECASE)
# replace any [, ], < in feature name since XGBoost has problems with it
X_train_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train_ohe.columns.values]
X_test_ohe.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test_ohe.columns.values]

# replace special characters since LightBGM has problems otherwise - the code below causes issues with XGBoost because it creates non unique features
# X_train_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_train_ohe.columns]
# X_test_ohe.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in X_test_ohe.columns]
  • Note that I am not removing DISPOSITION_CHARGED_OFFENSE_TITLE and UPDATED_OFFENSE_CATEGORY, both of which are open to interpretation and therefore depend on the judgement to some extent.
  • Also leaving CHARGE_DISPOSITION. All cases in this dataset were convicted but the sentence might depend on whether the defendant plead guilty or not.

Train and Run Models

In [ ]:
modelDC = DummyClassifier(strategy="most_frequent")
In [ ]:
# All of the non CatBoost models are disabled because they only run on CPU and
# are too slow (takes about 1 day). You can comment out these lines to run them
# but be sure to initially run on a sample of n=1000 or less. 

# modelLR = LogisticRegression(max_iter=1000)
In [ ]:
# modelXG = XGBClassifier()
In [ ]:
# modelLB = LGBMClassifier()
In [ ]:
# Note change task_type to 'CPU' if computer does not have an NVIDIA graphics card and CUDA installed.
# Note: running on CPU is about 400 times slower (and will take about a day).
modelCB = CatBoostClassifier(cat_features=categorical_features, task_type="GPU", verbose=1000)

Run Time with n=100

  • LogisticRegression - 0.5s (1)
  • XGBoost - 10s (20)
  • CatBoost - 90s (180)
  • Logistic regression n=10000 - 80s
In [46]:
n=len(X_train_cat) # full dataset
# Removing all except Catboost since they are too slow for running on the whole dataset.

print("DummyClassifier")
%timeit -n1 -r1 modelDC.fit(X_train_ohe.head(n), y_train_type.head(n))

# print("\nLogisticRegression")
# %timeit -n1 -r1 modelLR.fit(X_train_ohe.head(n), y_train_type.head(n))

# print("\nXGBoost")
# %timeit -n1 -r1 modelXG.fit(X_train_ohe.head(n), y_train_type.head(n))

print("\nCatBoost")
%timeit -n1 -r1 modelCB.fit(X_train_cat.head(n), y_train_type.head(n))


# Has problems with the feature names
# print("\nLightBGM")
# %timeit -n1 -r1 modelLB.fit(X_train_ohe.head(n), y_train_type.head(n))
DummyClassifier
1 loop, best of 1: 131 ms per loop

CatBoost
Learning rate set to 0.188242
0:	learn: 1.6173931	total: 24.4ms	remaining: 24.4s
999:	learn: 0.6428163	total: 20.5s	remaining: 0us
1 loop, best of 1: 22.5 s per loop
In [ ]:
# Print default catboost parameters
# modelCB.get_all_params()
In [49]:
print("DummyClassifier")
print(f"Train: {modelDC.score(X_train_ohe.head(n), y_train_type.head(n))}")
print(f"Test: {modelDC.score(X_test_ohe.head(n), y_test_type.head(n))}")

# print("\nLogisticRegression")
# print(f"Train: {modelLR.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLR.score(X_test_ohe.head(n), y_test_type.head(n))}")

# print("\nXGBoost")
# print(f"Train: {modelXG.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelXG.score(X_test_ohe.head(n), y_test_type.head(n))}")

print("\nCatBoost")
print(f"Train: {modelCB.score(X_train_cat.head(n), y_train_type.head(n))}")
print(f"Test: {modelCB.score(X_test_cat.head(n), y_test_type.head(n))}")


# print("\LightBGM")
# print(f"Train: {modelLB.score(X_train_ohe.head(n), y_train_type.head(n))}")
# print(f"Test: {modelLB.score(X_test_ohe.head(n), y_test_type.head(n))}")
DummyClassifier
Train: 0.5798097937809708
Test: 0.580421093631662

CatBoost
Train: 0.7223941315214334
Test: 0.6939172959989613

Shap Analysis of the Effect of Features on Sentence Type

Importance of Feature in Determining Prediction

  • From the bar chart below, the feature which most affects the outcome is UPDATED_OFFENSE_CATEGORY and OFFENSE_CATEGORY which is expected since this is what describes the crime.
  • Days_number also seems to be important. This is the number of days from an arbitrary starting point to the date of the incident.
  • AGE_AT_INCIDENT also seems to be an important factor in deciding the type of sentence.
  • It is nice to see that RACE is not at the top of the features that determine the sentence type.
In [50]:
explainer = shap.TreeExplainer(modelCB)
shap_values = explainer.shap_values(X_train_cat)
# shap_values.shape
shap.initjs()
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
In [51]:
# order was verified manually by looking at which column of predict_proba was the highest
class_labels = y_train_type.unique()
class_labels.sort()
class_labels
Out[51]:
array(['Boot Camp', 'Conditional Discharge', 'Court Supervision',
       'Drug Court Probation', 'Drug School', 'Gang Probation',
       'Intensive Drug Probation Services',
       'Intensive Probation Services', 'Juvenile IDOC', 'Life', 'Prison',
       'Probation', 'Sex Offender Probation'], dtype=object)
In [52]:
shap.summary_plot(shap_values, X_train_cat, class_names=class_labels)

How Features Affect Likelihood of a Predicting a Given Class

Notes:

  • The subplots at the top are the most important
  • Only numeric features have colour in the subplot.
    • For numeric features, the x-axis shows whether the feature is more (to the right) or less (to the left) likely to be predicted (i.e. the predict_proba score for this feature).
  • The only features that make sense in these plots are:

    • AGE_AT_INCIDENT
    • charge_count (number of other charges associated with a single trial. For example, a person would have 2 associated charges he commits burglary and in the process also manages to seriously injure someone else)
  • Note: Ignore days_number (an integer value of the date of arrest from some starting date) and all of the other features.


Below, I have analyzed feature importance for 3 out of the 13 possible classes in the target (categorical_sentence).

For predicting bootcamp, the most important feature is AGE_AT_INCIDENT. Younger people are very rarely get a BootCamp sentence.

The effect of age on the likelihood of being sentenced to Life is not monotomic (there is no clear gradation from red to blue).
Charge count very clearly affects probaility of getting Life sentences, with more charges making Life sentence more likely.

In [53]:
class_number = 9
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)
Shap plot for prediction class - categorical_sentence=Life

Age very strongly affects the likelihood of being sentenced to Drug Court Probation, with older people being much more likely to be given this sentence.
Also, from the charge count, it is clear that the more charges associated with a person, the less likely he is to be sentenced to Drug Court Probation (and therefore given a much more severe punishment).

In [54]:
class_number = 3
print(f"Shap plot for prediction class - categorical_sentence={class_labels[class_number]}")
shap.summary_plot(shap_values[class_number], X_train_cat)
Shap plot for prediction class - categorical_sentence=Drug Court Probation

Tuning Hyper-Parameters with RandomizedSearchCV

I tested this for CatBoost and it seems that it is not worth to optimize hyper paramers since CatBoost automatically selects very good hyper-paramers (this is a feature of CatBoost).
Even running tens of folds with pparameters faily close to the values CatBoost selects, the best score was still slightly lower than for the automatic parameters.
Leaving the code below for reference.

https://effectiveml.com/using-grid-search-to-optimise-catboost-parameters.html
CatBoost hyperparameters max reasonable range:

  • n_estimators=[0,5000]
  • learning_rate=[0.0001,1]
  • l2_leaf_reg=[0,100]
  • bootstrap_type=["Bayesian", "Bernoulli", "Poisson", "No"]
  • one_hot_max_size=[1,255]
  • max_depth=[1,16]
  • random_strength=[0,500]
  • border_count = [0,255]
  • ctr_border_count = [0,255]

Recommended parameter space:
https://github.com/talperetz/hyperspace/tree/master/GBDTs

Another good resource:
https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2

In [ ]:
# catboost_params_space = {
#     "n_estimators" : scipy.stats.randint(low=500, high=1500),  # too large takes too long to run
#     "learning_rate": [0.001, 0.01, 0.1, 1],
#     "l2_leaf_reg": scipy.stats.randint(low=1, high=100),
#     "bootstrap_type": ["Bernoulli", "No", "Poisson", "Bayesian"],
#     "one_hot_max_size": scipy.stats.randint(low=10, high=50),
#     "max_depth": scipy.stats.randint(low=1, high=11),  # too large takes too long to run
#     "random_strength": scipy.stats.randint(low=0, high=50),
#     "border_count": scipy.stats.randint(low=100, high=255),
# }

# Ran above configuration for 15 folds and it was worse than default parameters
# which means that catboost does a good job selecting default parameters.
# Trying to optimize only a few parameters that are known to be important.

catboost_params_space = {
    "n_estimators" : scipy.stats.randint(low=500, high=1500), # default=0.188242, too large takes too long to run
    "one_hot_max_size": scipy.stats.randint(low=2, high=20),  # default=2
    "max_depth": scipy.stats.randint(low=3, high=11),         # default=6 too, large takes too long to run
}
In [ ]:
# Randomized Search. Do not run in parallel if using CatBoost.
random_search = RandomizedSearchCV(modelCB, param_distributions = catboost_params_space, 
                                   n_iter = 5, cv=2, verbose=2,
                                   random_state=123, n_jobs=1)
In [ ]:
# Uncomment this code to run. Warning will take a long time even when running on GPU.

# n=len(X_train_cat)
# %timeit -n1 -r1 random_search.fit(X_train_cat.head(n), y_train_type.head(n));

# print(f"Best Parameters - {random_search.best_params_}")
# print(f"Best Score - {random_search.best_score_}")

Appendix

In [ ]:
# Each crime may have multiple charges, so must ensure that two cases from the same crime do not end up in different tain/test splits
# because this will lead to data leaking to the test set.

# Note that during cross validation, I ignore this because it is too much trouble to do the splits properly, so the train scores will likely be inflated
# compared to the test scores, but since the test split is done correctly, the test score will be valid.

# Technically, I shouldn't need to do this if my understanding of how sentencing works (each case gets an independent sentence),
# but doing this to be on the safe side.

# sentencing_sorted = sentencing.sort_values(by=["CASE_ID"])
# random.seed(a=123, version=2)  # for reproduceability
# df_train = pd.DataFrame()
# df_test = pd.DataFrame()
# split_ratio = 0.2

# # estimated to take exactly 1 hour on 233k examples
# previous_id = 0
# last_set = 0; # "train" or "test"

# total_length = len(sentencing_sorted)

# for i in range(len(sentencing_sorted)):
#     if (i%100 == 0): print(i, time.time())
#     curr_line = sentencing_sorted.iloc[i]
#     if curr_line["CASE_ID"] == previous_id:
#         df_train.append(curr_line) if (last_set=="train") else df_test.append(curr_line)
#     else:
#         # sample random number to decide which dataset
#         if (random.random() < split_ratio):
#             last_set = "train"
#             df_train.append(curr_line)
#         else:
#             last_set = "test"
#             df_test.append(curr_line)
            
# print("Done split")
In [ ]:
# do nothing estimator; https://scikit-learn.org/stable/developers/develop.html
# Useful if you want to use remainder="drop" when building ColumnTransformer,
# but you still want to keep select features without processing them.
class Nothing(BaseEstimator, ClassifierMixin):

    def __init__(self, demo_param='demo'):
        self.demo_param = demo_param

    def fit(self, X, y):
        # Do nothing
        print()
        return self

    def predict(self, X):
        # Do nothing
        return None

    def transform(self, data):
        # return data without doing anything
        return data

passthrough_transformer = Pipeline([
    ('do_nothing', Nothing())
])

Something to try which might improve the model:

  • Try dropping all rows for a given "CASE_ID" except for the row which had PRIMARY_CHARGE=True.

Testing on Some Synthetic Data

In [ ]:
example = X_train_cat.head(1).copy()
In [172]:
example
Out[172]:
AGE_AT_INCIDENT month_sin month_cos days_number CHARGE_COUNT OFFENSE_CATEGORY DISPOSITION_CHARGED_OFFENSE_TITLE CHARGE_DISPOSITION GENDER RACE UPDATED_OFFENSE_CATEGORY DISPOSITION_CHARGED_CHAPTER DISPOSITION_CHARGED_ACT DISPOSITION_CHARGED_SECTION DISPOSITION_CHARGED_CLASS INCIDENT_CITY LAW_ENFORCEMENT_AGENCY UNIT PRIMARY_CHARGE
56210 0.489372 0.738924 1.2862 -0.75152 -0.252813 Narcotics POSSESSION OF A CONTROLLED SUBSTANCE Plea Of Guilty Male Black Narcotics 720 570 402(c) 4 Chicago CHICAGO PD District 10 - Ogden True
In [173]:
# Prediction at start
modelCB.predict(example.head(1))
Out[173]:
array([['Prison']], dtype=object)
In [174]:
# Change race from Black to White:
example.loc[56210,"RACE"] = "White"
modelCB.predict(example)
Out[174]:
array([['Probation']], dtype=object)
In [175]:
# Change race back to Black:
example.loc[56210,"RACE"] = "Black"
modelCB.predict(example)
Out[175]:
array([['Prison']], dtype=object)
In [176]:
# Change Gender from Male to Female:
example.loc[56210,"GENDER"] = "Female"
modelCB.predict(example)
Out[176]:
array([['Probation']], dtype=object)
In [177]:
# Make the person older than current age z-score=0.489372
example.loc[56210,"AGE_AT_INCIDENT"] = 0.9
modelCB.predict(example)
Out[177]:
array([['Prison']], dtype=object)
In [178]:
# Make the person younger again
example.loc[56210,"AGE_AT_INCIDENT"] = 0.489372
modelCB.predict(example)
Out[178]:
array([['Probation']], dtype=object)
In [179]:
# Change crime from "Posession Of a Controlled" Substance to "Armed Robbery"
example.loc[56210,"OFFENSE_CATEGORY"] = "Armed Robbery" 
example.loc[56210,"DISPOSITION_CHARGED_OFFENSE_TITLE"] = "ARMED ROBBERY"
example.loc[56210,"UPDATED_OFFENSE_CATEGORY"] = "Armed Robbery"
example.loc[56210,"DISPOSITION_CHARGED_CHAPTER"] = "720"
example.loc[56210,"DISPOSITION_CHARGED_ACT"] = "5"
example.loc[56210,"DISPOSITION_CHARGED_SECTION"] = "18-2(a)(2)"
example.loc[56210,"DISPOSITION_CHARGED_CLASS"] = "X"
modelCB.predict(example)
Out[179]:
array([['Prison']], dtype=object)