This short notebook illustrates basic usage of the OutlierTree
library for explainable outlier detection using the Titanic dataset. For
more details, you can check the package’s documentation at CRAN or
through R’s help (e.g. ?outliertree::outlier.tree
). For a
more interesting and interactive example, see the documentation of the
main function (outlier.tree
), which uses a larger
dataset.
The dataset is very popular and can be downloaded from different sources, such as Kaggle or many university webpages. This vignette took it from the following link: https://github.com/jbryer/CompStats/raw/master/Data/titanic3.csv
The data comes bundled in the package so there is no need to download it from the link above.
library(data.table)
library(kableExtra)
library(outliertree)
data("titanic")
|>
titanic head(5) |>
kable() |>
kable_styling()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NA | St Louis, MO |
1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NA | Montreal, PQ / Chesterville, ON |
1 | 0 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | NA | Montreal, PQ / Chesterville, ON |
1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | 135 | Montreal, PQ / Chesterville, ON |
1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NA | NA | Montreal, PQ / Chesterville, ON |
## Capitalize column names and some values for easier reading
<- function(x) gsub("^(\\w)", "\\U\\1\\E", x, perl=TRUE)
capitalize
<- as.data.table(titanic)
titanic
titanic[setnames(.SD, names(.SD), capitalize(names(.SD)))
,
][setnames(.SD, "Sibsp", "SibSp")
,
][:= capitalize(Sex)
, Sex -> titanic
]
## Convert 'survived' to yes/no for easier reading
titanic[:= ifelse(Survived, "Yes", "No")
, Survived
]
## Some columns are not useful, such as name (an ID), ticket number (another ID),
## or destination (too many values, many non-repeated)
titanic[!c("Name", "Ticket", "Home.dest")
, -> titanic
]
## Ordinal columns need to be passed as ordered factors
<- c("Pclass", "Parch", "SibSp")
cols_ord
titanic[:= lapply(.SD, function(x) factor(x, ordered = TRUE))
, (cols_ord) = cols_ord
, .SDcols
]
## A look at the processed data
|>
titanic head(5) |>
kable() |>
kable_styling()
Pclass | Survived | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Boat | Body |
---|---|---|---|---|---|---|---|---|---|---|
1 | Yes | Female | 29.00 | 0 | 0 | 211.3375 | B5 | S | 2 | NA |
1 | Yes | Male | 0.92 | 1 | 2 | 151.5500 | C22 C26 | S | 11 | NA |
1 | No | Female | 2.00 | 1 | 2 | 151.5500 | C22 C26 | S | NA | NA |
1 | No | Male | 30.00 | 1 | 2 | 151.5500 | C22 C26 | S | NA | 135 |
1 | No | Female | 25.00 | 1 | 2 | 151.5500 | C22 C26 | S | NA | NA |
library(outliertree)
## Fit model with default hyperparameters
<- outlier.tree(titanic)
otree otree
Reporting top 9 outliers [out of 9 found]
row [171] - suspicious column: [Fare] - suspicious value: [0.00]
distribution: 98.571% >= 25.74 - [mean: 55.22] - [sd: 27.56] - [norm. obs: 69]
given:
[Pclass] = [1]
[Boat] in [1, 15, 5, 5 7, 5 9, 7, 8 10, 9, B, C] (value: C)
row [19] - suspicious column: [Age] - suspicious value: [32.00]
distribution: 96.000% >= 43.00 - [mean: 48.35] - [sd: 3.16] - [norm. obs: 24]
given:
[Cabin] in [A16, A20, B10, B52 B54 B56, B82 B84, C110, C116, C124, C126, C86, C92, D15, D17, D33, D46, E12, E31, E58, E63] (value: D15)
row [897] - suspicious column: [Fare] - suspicious value: [0.00]
distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
given:
[Pclass] = [3]
[SibSp] = [0]
row [899] - suspicious column: [Fare] - suspicious value: [0.00]
distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
given:
[Pclass] = [3]
[SibSp] = [0]
row [964] - suspicious column: [Fare] - suspicious value: [0.00]
distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
given:
[Pclass] = [3]
[SibSp] = [0]
row [1255] - suspicious column: [Fare] - suspicious value: [0.00]
distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
given:
[Pclass] = [3]
[SibSp] = [0]
row [1045] - suspicious column: [Fare] - suspicious value: [15.50]
distribution: 96.774% <= 8.52 - [mean: 7.73] - [sd: 0.28] - [norm. obs: 30]
given:
[Pclass] = [3]
[SibSp] = [0]
[Boat] in [10, 13 15, 13 15 B, 15 16, 16, 6, 9, A, B] (value: 16)
row [1147] - suspicious column: [Fare] - suspicious value: [29.12]
distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
given:
[Pclass] = [3]
[SibSp] = [0]
[Embarked] = [Q]
row [1164] - suspicious column: [Fare] - suspicious value: [24.15]
distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
given:
[Pclass] = [3]
[SibSp] = [0]
[Embarked] = [Q]
Outlier Tree model
Numeric variables: 3
Categorical variables: 5
Ordinal variables: 3
Consists of 220 clusters, spread across 16 tree branches
## Double-check the data (last 2 outliers)
c(1147, 1164), ] titanic[
## Pclass Survived Sex Age SibSp Parch Fare Cabin Embarked Boat Body
## 1: 3 No Female 39 0 5 29.125 <NA> Q <NA> 327
## 2: 3 No Male NA 0 0 24.150 <NA> Q <NA> NA
## Distribution of the group from which those two outliers were flagged
titanic[== 3 &
Pclass == 0 &
SibSp == "Q"
Embarked
][
, Fare|>
] hist(breaks = 100, col = "navy", xlab="Fare",
main="Distribution of Fare within cluster")