The following code shows the gini coefficient:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
To visualize the results of the decision with a graph, use the graphviz function:
export_graphviz(treeclf, out_file='tree_kdd.dot', feature_names=X.columns)
At the command line, we run this into convert to PNG:
# dot -Tpng tree_kdd.dot -o tree_kdd.png
We then extract the feature importance:
pd.DataFrame({'feature':X.columns, 'importance':treeclf.feature_importances_}).sort_values('importance', ascending=False).head(10)
The output can be seen as follows:
Feature |
Importance | |
|
20 |
srv_count |
0.633722 |
|
25 |
same_srv_rate |
0.341769 |
|
9 |
num_compromised |
0.013613 |
|
31 |
dst_host_diff_srv_rate |
0.010738 |
|
1 |
src_bytes |
0.000158 |
|
85 |
service__red_i |
0.000000 |
|
84 |
service__private |
0.000000 |
|
83 |
service__printer |
0.000000 |
|
82 |
service__pop_3 |
0.000000 |
|
75 |
service__netstat |
0.000000 |