3
$\begingroup$

I am attempting to identify possible outliers in data which is skewed to the right and I assume it is Poisson distributed. I am a novice in all things statistics, and the following may be utterly erroneous. However, I am eager to learn.

I have scoured 冠通棋牌-【官网首页】 and Stack Overflow for ideas in detecting outliers in situations like mine, but I couldn't find any instances where someone attempted to write an R script for their project to find out outliers in skewed, Poisson distributed data.

My actual data is shown below as the vector parktimes, (n=5222). It is the results of a survey where respondents answered how long it took them (in minutes) to park their car in a postal code area in Helsinki, Finland. Respondents could answer for multiple postal code areas at the same time, leaving the data with some identical timestamps with different values for different postal code areas. Most people reported finding a parking place almost instantaneously, making the data skewed to the right. The allowed sequence here was 0-99, but 99 minutes to find a parking place in Helsinki seems unbelivable and yet someone answered with that value for multiple postal code areas. I would like to find a statistical solution to remove these improbable values if they are indeed outliers. The code below does not provide the exact timestamps to be more concise, substituting to the index.

Here is a histogram of parktime values with ggplot: ggplot(thesisdata, aes(parktime)) + geom_histogram(color = "black", binwidth = 5) parktime histogram

Using this source and this source I have written a R script that I think detects outliers in my data. Simplified, my attempted outlier detection is as follows:

  1. Import data
  2. Apply Anscombe transform to data.table column parktime like so: anscombe_parktime <- 2.0 * sqrt(parktime + 3.0 / 8.0)
  3. Calculate the probability of observing a point under a Poisson distribution: ppois(anscombe_parktime, mean(anscombe_parktime)
  4. Plot results

With Anscombe transform (y axis is parktime): Poisson distributed data (assumption) plotted with Anscombe transform

Without Anscombe transform: Poisson distributed data (assumption) plotted without Anscombe transform

Is this a legitimate way to search for outliers in my data? Can Anscombe transform be used in this way to wrangle the data? Is my data even applicable for this kind of analysis?

My code:

library(ggplot2)
library(data.table)
library(outliers)

parktimes <- c(99,5,0,1,10,99,99,1,1,3,1,1,2,5,2,2,2,5,10,5,2,2,0,1,1,1,5,3,5,5,
               1,0,0,5,1,0,0,2,2,0,5,10,1,1,1,5,5,3,10,1,1,1,1,0,10,2,10,7,10,7,
               3,3,13,1,3,1,1,1,4,4,1,2,3,1,1,1,1,1,1,2,1,1,2,3,0,7,8,3,3,3,5,4,
               25,0,10,0,10,6,3,0,0,1,2,1,0,0,0,0,0,0,3,1,0,1,2,1,0,1,5,5,5,3,0,
               0,0,0,2,1,3,0,1,5,5,5,2,0,2,0,5,15,3,4,3,4,2,5,1,10,10,2,0,1,1,1,
               0,0,1,0,10,5,15,1,10,0,0,2,1,5,1,1,2,2,3,1,1,1,1,4,4,1,3,3,1,3,1,
               2,1,0,1,2,2,5,1,2,1,3,5,1,1,1,1,5,4,5,2,15,15,2,5,2,5,8,2,8,5,5,2,
               0,1,3,2,1,1,1,1,1,1,1,1,10,3,1,8,10,10,12,5,5,3,6,4,2,1,3,2,0,0,1,
               3,1,1,1,1,2,1,3,1,1,2,1,1,3,1,1,1,3,2,1,1,2,2,1,4,1,1,1,1,2,1,2,3,
               4,1,2,1,2,10,1,0,0,3,3,10,1,4,0,2,5,5,1,4,0,5,1,1,1,3,0,1,5,1,1,1,
               1,1,1,5,5,5,5,5,10,20,1,1,1,0,0,0,0,1,0,2,0,2,2,2,0,1,1,1,2,2,2,0,
               1,0,1,2,1,5,0,0,10,1,2,1,2,2,3,2,3,1,1,2,5,2,1,5,5,2,10,2,4,0,5,0,
               1,1,5,1,2,5,1,1,3,4,1,6,6,5,2,10,10,10,60,7,1,15,10,0,5,15,1,0,2,
               0,0,0,2,1,2,3,3,2,2,3,3,2,3,1,3,5,1,2,1,3,10,1,1,1,1,5,3,1,6,12,5,
               7,6,5,2,0,3,1,5,10,30,45,45,30,45,0,0,0,0,5,5,0,3,5,2,5,10,10,2,5,
               10,2,1,30,5,2,2,7,1,1,2,4,5,5,1,1,1,5,2,2,2,2,1,5,0,1,3,5,5,1,2,
               15,10,0,1,10,8,10,25,5,10,5,12,20,7,12,2,5,2,10,3,10,5,5,5,5,5,7,
               3,7,3,6,9,7,1,1,10,10,1,1,1,1,2,1,15,30,1,10,5,20,1,10,1,35,10,0,
               5,25,35,10,1,5,5,10,20,5,5,5,10,10,15,2,2,1,1,1,1,1,3,5,5,5,1,1,5,
               10,10,15,15,25,20,5,15,5,0,5,5,2,5,3,10,2,5,5,1,15,8,4,6,5,15,20,
               20,20,15,15,15,30,15,10,5,5,10,10,10,10,5,5,0,10,1,5,1,2,0,2,2,5,
               10,15,3,15,3,4,3,2,1,3,4,5,4,2,10,1,1,1,1,5,1,10,5,5,10,5,1,5,7,
               10,10,5,10,5,1,2,15,10,1,10,10,15,10,10,5,2,2,2,5,5,10,5,5,2,5,5,
               2,5,10,10,20,5,1,2,2,5,2,5,1,1,15,10,20,15,4,15,15,5,15,5,0,5,1,0,
               0,5,6,7,1,3,2,3,2,0,10,15,10,10,3,30,10,30,5,10,20,10,0,1,10,1,2,
               2,1,1,0,1,10,0,10,15,5,5,10,5,8,4,10,10,3,3,5,5,1,4,0,15,2,10,10,
               2,2,10,2,5,10,1,1,1,1,1,2,2,1,1,1,2,1,1,2,2,8,4,5,1,3,5,10,1,2,1,
               2,1,0,1,0,8,10,3,15,0,0,0,1,2,0,1,0,5,2,10,5,2,10,5,1,1,0,2,5,1,1,
               1,3,2,3,2,2,6,9,9,9,8,2,9,10,5,10,1,15,10,4,5,5,5,1,7,1,10,2,2,8,
               2,2,7,1,1,10,2,8,10,2,5,5,4,3,5,5,8,6,8,4,2,10,15,4,8,3,6,5,5,6,0,
               1,10,15,10,3,5,1,8,10,7,1,1,2,5,10,10,15,0,2,5,5,5,10,3,5,1,4,1,1,
               14,24,5,5,15,3,0,5,0,5,5,6,0,1,2,1,1,4,1,10,2,5,1,1,5,8,5,10,19,0,
               3,5,2,5,0,2,2,5,1,2,2,5,1,2,2,1,5,2,2,1,1,5,15,1,1,1,5,1,1,7,5,3,
               5,1,10,1,1,2,4,1,1,2,4,2,1,0,1,2,1,10,5,10,3,15,1,1,15,5,10,1,1,
               1,10,20,20,5,1,10,15,1,10,5,1,5,5,5,5,5,20,20,5,1,5,5,10,5,5,20,
               5,15,15,10,2,0,0,3,2,5,1,2,1,0,3,0,5,1,1,1,5,1,1,5,10,10,0,1,1,1,
               1,5,5,10,5,5,1,8,10,10,10,2,3,5,3,15,3,5,0,0,0,1,1,1,1,0,1,1,1,1,
               1,1,1,1,0,1,2,1,1,1,1,0,1,1,1,10,15,10,10,10,20,5,3,1,7,7,5,20,1,
               2,5,5,5,5,0,7,1,5,1,1,1,1,1,1,5,1,3,1,3,2,2,5,0,45,5,10,10,5,10,5,
               1,2,5,2,5,2,1,1,5,2,15,20,10,35,5,5,5,5,10,20,15,15,1,2,5,5,2,2,3,
               5,1,1,10,10,1,1,1,0,2,3,7,2,1,2,2,1,2,3,4,2,1,28,20,1,5,5,8,2,0,0,
               3,8,1,3,2,15,15,15,8,4,20,0,2,2,5,1,1,5,7,5,0,5,1,15,2,2,12,10,6,
               15,0,2,4,5,5,10,1,1,1,1,2,6,2,1,0,1,3,3,5,3,6,8,2,60,90,15,3,10,1,
               5,3,1,6,1,2,2,7,3,3,15,25,10,5,10,8,7,1,1,1,5,3,5,1,2,5,0,1,2,1,2,
               1,1,1,1,5,2,25,20,0,0,4,1,5,1,15,10,1,1,3,1,1,5,6,5,1,14,15,6,15,
               8,7,1,4,8,5,2,1,0,1,1,1,2,6,3,5,5,2,8,4,1,10,5,4,8,3,3,3,1,3,2,1,
               2,3,1,2,6,3,4,6,2,8,1,5,5,1,2,6,1,3,1,2,0,1,5,3,1,3,5,3,5,7,2,5,
               15,2,2,5,1,3,5,7,10,5,5,10,10,10,5,2,10,7,20,2,5,10,5,2,2,4,3,5,
               2,1,10,2,5,20,5,20,5,1,0,0,2,2,1,5,30,99,10,1,5,10,10,5,2,10,1,5,
               3,2,10,4,1,5,5,2,10,5,1,2,10,4,5,3,2,2,1,0,2,55,0,3,10,3,20,5,20,
               5,5,3,5,5,5,3,1,5,10,10,5,1,10,0,2,5,1,2,20,5,2,10,5,5,8,1,5,10,2,
               5,1,3,1,2,3,5,1,1,5,5,20,5,5,15,1,5,1,5,1,5,99,99,20,99,99,99,99,
               2,2,2,1,2,3,1,2,2,1,2,1,2,1,1,2,2,2,1,2,1,1,1,1,1,1,1,1,4,1,1,1,
               2,2,3,2,3,2,1,2,3,2,2,2,2,5,2,5,5,3,2,3,2,3,3,5,2,5,5,1,1,1,1,3,2,
               2,3,3,2,10,5,1,3,3,0,2,10,5,2,2,3,2,5,3,2,15,5,7,10,1,5,5,2,2,3,2,
               2,10,10,15,2,5,15,5,10,6,3,5,2,5,5,5,8,4,4,5,5,4,2,2,5,2,5,5,0,5,
               2,5,5,0,0,0,5,10,5,10,1,5,5,1,1,3,20,20,0,0,3,0,2,1,2,1,1,2,1,1,8,
               2,2,5,5,0,3,20,6,1,2,4,1,15,2,4,5,5,2,5,10,5,1,1,1,3,2,1,2,3,4,6,
               5,10,5,5,2,10,10,10,10,10,10,0,10,10,5,10,10,5,5,5,10,10,10,5,1,1,
               3,10,5,5,1,1,0,0,2,10,10,5,5,5,2,2,5,2,10,5,10,1,10,3,2,1,3,2,3,3,
               5,1,1,2,6,3,5,5,10,5,3,5,5,10,5,4,5,3,3,1,2,1,3,5,1,1,1,1,1,2,2,5,
               6,2,4,2,2,2,5,10,2,2,3,3,2,1,2,2,4,2,1,5,10,5,1,1,3,0,5,3,5,5,1,2,
               2,5,3,1,10,2,5,3,10,10,3,10,5,2,3,10,0,2,3,2,1,0,10,2,0,1,2,4,2,2,
               5,2,7,0,0,5,7,7,5,1,5,10,5,1,3,4,6,5,2,15,5,4,10,3,2,10,3,3,4,10,
               2,8,5,0,2,1,1,3,3,1,1,1,1,1,1,2,1,3,1,1,10,2,1,1,0,1,0,10,30,5,15,
               5,5,10,5,5,5,5,1,0,0,0,7,1,5,5,2,1,2,5,20,30,15,15,1,0,0,0,0,2,5,
               0,0,0,3,0,0,2,5,0,0,4,0,1,2,3,0,4,3,1,1,3,20,5,5,10,10,15,15,10,5,
               3,1,4,10,10,2,10,2,1,5,5,2,2,2,1,1,1,1,1,3,2,2,3,1,7,1,1,3,1,1,3,
               3,2,5,2,2,5,5,2,1,3,1,1,1,2,5,5,1,10,2,3,5,1,5,10,0,5,5,0,0,3,3,1,
               1,1,15,3,15,2,2,5,1,5,0,1,1,2,2,1,4,5,1,3,2,10,3,5,7,10,3,3,3,4,3,
               2,2,0,0,1,1,4,1,3,1,1,3,5,1,10,15,3,3,1,1,5,5,2,10,2,5,5,7,5,8,7,
               6,4,5,4,4,2,8,10,9,15,8,5,0,0,2,5,0,5,1,3,2,5,20,10,30,10,30,15,
               10,15,15,10,10,10,10,5,15,1,1,2,0,1,4,5,5,0,2,5,4,1,2,0,0,1,2,1,5,
               6,1,1,3,1,1,1,1,3,5,10,5,5,2,5,0,1,3,0,3,5,5,15,10,10,0,5,10,5,2,
               10,5,2,10,5,2,5,10,5,1,20,5,15,5,5,5,5,5,5,5,10,10,5,5,5,5,5,10,5,
               0,0,10,10,5,5,1,25,5,1,1,5,1,2,1,1,1,2,3,10,1,30,10,20,10,20,5,15,
               10,10,15,25,15,1,0,7,2,1,0,3,3,4,15,5,15,10,3,10,5,3,2,1,1,3,1,3,
               25,0,10,5,7,5,20,10,18,20,5,2,1,1,1,1,1,1,2,2,5,2,2,5,5,10,5,10,10,
               3,2,1,1,8,5,2,2,5,5,5,1,5,5,2,15,0,0,2,10,5,1,1,2,0,5,1,5,5,5,2,10,
               5,0,5,5,1,4,1,0,4,0,3,4,1,1,0,0,3,5,1,2,1,10,5,5,2,2,3,0,20,2,5,1,0,
               3,1,5,5,15,5,5,5,2,0,3,3,0,0,5,5,5,1,2,3,1,10,10,1,1,3,1,0,5,0,10,5,
               10,10,10,0,2,3,2,0,10,2,15,2,6,2,10,5,2,3,10,3,5,3,3,5,3,5,4,3,10,5,
               5,5,10,2,4,5,6,8,5,5,4,2,15,4,15,5,10,5,5,2,1,1,1,2,3,2,3,4,5,0,10,
               15,5,5,1,3,15,1,10,3,1,10,5,5,5,3,7,8,1,10,3,3,0,0,7,15,15,5,3,15,
               2,10,1,7,5,20,2,10,5,1,1,1,2,1,5,15,15,5,1,5,7,9,3,2,5,5,15,10,20,
               0,20,25,5,15,10,2,3,2,2,5,2,1,5,5,6,6,1,1,3,1,1,3,3,10,2,20,20,5,5,
               4,0,30,20,5,15,0,10,10,1,6,3,1,2,2,10,2,1,1,1,0,10,2,2,5,5,4,5,16,
               2,1,10,30,15,5,3,2,10,10,1,3,1,3,2,2,10,2,1,3,1,1,1,1,3,3,5,7,5,3,
               10,5,1,10,2,2,1,1,5,1,2,3,2,2,2,5,1,1,1,10,2,1,1,1,3,1,6,1,3,5,1,
               3,10,10,0,0,0,0,0,15,10,10,15,1,7,3,5,5,1,5,10,6,2,4,2,2,1,1,4,2,
               1,2,4,1,3,3,1,1,1,2,1,2,2,2,4,1,1,1,2,2,1,2,1,2,4,4,2,1,8,3,1,3,2,
               5,5,2,2,4,3,3,1,1,1,2,1,2,2,1,2,3,2,2,5,0,0,0,3,5,1,1,1,1,2,2,5,5,
               5,0,4,1,1,5,10,5,5,3,1,3,3,4,5,1,3,2,3,3,3,2,3,2,4,5,3,5,2,5,5,6,1,
               3,7,4,30,3,1,1,3,15,10,2,1,5,1,1,2,1,3,1,1,2,3,1,1,1,1,1,2,1,1,10,
               2,2,2,2,5,1,25,30,10,3,15,5,5,30,20,20,40,35,20,10,5,0,5,2,15,20,
               2,7,10,2,2,1,15,5,0,20,10,0,10,15,1,3,1,0,1,2,1,0,3,5,2,4,7,6,7,4,
               2,2,1,2,2,2,2,6,1,8,6,5,2,5,4,2,5,2,3,3,1,2,1,1,3,2,3,15,2,2,1,4,
               1,2,1,1,1,2,1,2,1,1,2,2,1,2,1,1,1,1,1,2,10,2,5,10,20,10,5,10,10,5,
               20,15,10,5,20,20,15,10,25,15,20,15,10,15,2,15,5,5,3,1,5,1,5,2,1,0,
               5,4,1,2,1,3,5,5,5,5,10,8,1,5,10,5,5,2,10,2,2,10,1,5,5,1,1,10,5,2,
               5,1,3,2,5,10,10,5,10,1,10,3,15,1,10,5,2,3,5,10,3,15,30,5,20,1,2,2,
               1,3,7,8,10,5,7,5,9,6,5,8,9,7,6,5,5,7,6,2,3,10,10,15,5,1,2,5,2,1,3,
               10,1,5,1,10,1,5,1,2,15,5,1,15,1,5,5,10,15,5,2,10,0,0,5,6,0,1,2,0,3,
               0,1,5,7,2,5,1,2,1,10,2,2,2,5,5,10,5,0,5,2,10,1,1,3,10,3,1,4,2,0,1,
               5,1,8,5,5,1,3,5,5,2,1,5,5,5,5,0,5,0,13,10,2,9,2,0,0,5,5,5,5,5,0,1,
               0,2,1,5,4,2,5,4,1,1,5,1,1,15,10,5,0,15,15,0,0,4,5,2,15,5,15,3,3,
               10,10,5,3,7,13,0,0,2,4,1,2,4,1,5,3,8,10,10,5,10,2,5,10,7,10,8,2,5,
               7,6,7,5,2,5,1,2,1,8,4,10,5,15,10,5,3,1,5,2,5,1,2,5,1,1,5,2,1,5,0,
               10,20,5,5,2,2,10,5,2,0,1,1,2,1,1,1,1,1,1,1,1,2,1,3,1,1,5,2,3,1,2,
               0,1,1,5,1,5,2,2,2,5,5,5,15,15,5,10,5,5,15,5,10,5,10,5,7,5,1,5,7,5,
               10,1,2,3,2,1,2,1,3,5,3,5,3,2,4,5,2,1,5,5,20,5,10,10,10,10,5,3,5,2,
               10,4,1,3,5,5,4,7,5,3,5,2,2,10,4,0,8,2,4,3,15,5,2,8,3,10,5,20,2,0,
               0,10,1,1,1,1,1,1,0,0,2,0,10,20,2,10,2,1,3,2,2,5,3,4,1,5,3,1,1,7,2,
               4,5,4,5,5,5,10,1,1,3,5,5,0,0,1,1,1,5,0,0,0,0,1,1,2,0,3,0,10,1,2,1,
               1,10,0,2,2,5,1,5,3,5,1,3,3,10,0,0,0,5,5,1,2,1,1,2,3,10,10,5,4,1,5,
               5,2,3,1,1,5,1,2,25,0,5,5,2,3,1,1,2,1,2,1,5,5,5,5,15,5,5,1,3,2,5,2,
               4,2,10,1,7,10,20,5,10,5,1,3,10,2,20,10,15,1,10,1,5,1,3,2,5,6,3,10,
               3,15,7,5,10,1,1,1,1,1,1,4,1,10,0,0,0,0,0,2,0,0,2,0,0,0,10,5,2,2,3,
               3,4,1,2,2,10,8,1,3,1,4,15,5,1,5,0,2,0,3,2,3,0,1,5,2,1,0,1,3,1,10,0,
               3,3,1,1,1,5,1,1,1,1,1,1,3,1,3,2,10,0,10,2,10,1,1,1,1,1,1,1,0,3,0,1,
               3,0,1,4,3,5,1,10,5,2,5,10,2,2,3,15,10,10,5,10,5,2,5,5,10,2,1,2,0,5,
               5,2,2,2,2,2,10,10,10,3,10,2,1,1,2,3,1,5,2,1,1,3,4,1,2,1,3,2,1,1,2,
               1,2,0,1,3,5,1,3,3,2,1,2,3,2,5,3,2,3,1,3,8,1,4,2,2,4,5,11,1,6,2,10,
               3,0,0,0,20,10,15,5,15,7,7,10,3,5,2,3,1,0,0,0,0,5,1,3,2,1,1,1,2,1,2,
               2,5,2,1,1,2,1,2,0,0,3,0,0,0,2,2,5,5,5,1,60,15,2,0,3,5,5,1,2,10,2,0,
               2,15,5,1,20,3,0,10,0,5,10,0,0,10,0,0,5,0,5,2,2,10,1,1,5,1,5,2,5,2,
               15,20,15,5,5,5,15,5,2,10,20,1,1,2,1,1,5,1,5,3,3,1,3,15,6,15,10,10,
               15,20,10,1,1,1,3,3,4,4,15,1,10,5,5,4,0,1,2,2,2,2,3,2,3,5,2,1,1,2,
               3,2,5,15,4,3,1,5,0,1,2,1,3,0,1,5,1,1,0,5,0,0,0,10,5,5,5,5,10,0,1,
               1,2,15,10,30,1,1,0,2,3,2,4,5,10,3,10,1,1,1,7,3,1,3,3,3,10,5,3,2,7,
               0,5,2,0,30,20,10,10,10,10,10,10,10,10,10,5,5,5,5,10,2,5,5,2,20,5,
               30,15,10,5,6,5,20,1,10,10,1,1,5,5,1,5,5,10,15,15,5,10,10,5,3,3,5,
               10,5,0,5,5,1,5,5,15,20,5,5,5,1,15,5,20,1,2,10,1,2,0,1,5,5,10,1,5,
               1,1,1,1,1,2,2,10,10,3,5,0,3,1,1,1,0,1,3,1,1,5,0,10,5,0,0,3,3,5,0,
               0,1,10,5,5,3,10,10,10,2,35,20,25,15,5,5,2,2,5,2,5,0,3,3,1,30,10,
               15,5,20,5,10,10,20,15,5,10,5,5,15,20,15,5,0,1,4,10,3,4,26,5,10,10,
               1,5,0,0,5,5,5,5,10,30,2,2,5,1,3,3,1,1,1,3,1,3,7,3,15,20,0,15,5,25,
               3,25,0,30,0,5,1,1,2,1,1,5,10,5,0,0,20,1,0,15,5,5,15,15,15,15,15,10,
               10,15,10,30,30,20,20,5,5,1,4,4,5,5,10,2,0,5,1,1,15,15,5,4,1,1,3,3,
               1,0,15,0,10,20,15,5,4,0,0,2,1,0,2,0,2,1,1,2,2,1,0,5,4,3,3,5,5,2,1,
               5,4,2,10,2,2,10,3,3,5,10,1,0,10,5,0,10,5,10,5,10,10,60,30,30,99,0,
               2,1,0,1,1,2,1,2,1,5,1,1,1,5,5,5,1,0,1,0,0,0,0,3,3,10,2,5,2,2,1,5,3,
               6,2,3,7,5,3,1,1,1,1,1,5,5,5,5,7,2,5,5,10,2,2,5,5,5,10,5,5,5,5,5,5,
               10,15,5,5,5,5,0,2,10,0,2,5,0,1,10,2,1,1,2,4,5,1,2,2,0,5,2,2,3,3,1,
               1,10,0,3,0,1,10,12,3,2,6,9,3,5,2,1,1,1,3,4,5,10,5,10,15,20,6,5,5,
               5,1,5,15,5,5,10,8,3,15,12,0,5,2,5,5,3,5,4,1,1,3,1,5,2,10,20,1,15,
               15,10,3,1,3,2,0,5,0,1,0,1,2,2,1,1,0,1,10,1,5,1,1,1,4,0,5,1,1,15,10,
               1,5,5,5,1,10,0,10,2,1,99,99,99,99,99,5,1,10,30,3,5,5,10,10,0,10,0,
               4,1,12,5,1,4,1,3,0,15,3,10,5,1,2,1,1,1,2,1,0,1,1,3,5,2,25,15,20,1,
               5,2,10,3,3,4,1,3,2,1,5,3,10,1,10,5,1,25,5,20,10,20,15,15,10,10,18,
               0,5,1,0,5,2,10,5,5,2,5,5,3,1,3,2,0,2,1,5,99,99,99,99,99,99,99,99,
               99,99,2,5,1,3,5,5,0,2,5,7,10,2,15,3,30,20,2,1,0,1,0,1,2,5,4,1,1,1,
               2,2,0,2,2,2,2,2,1,3,10,20,15,10,2,3,5,10,5,0,10,10,10,15,1,1,9,2,
               1,7,5,5,5,3,2,2,1,2,1,1,5,1,20,2,5,15,5,5,3,5,2,3,15,1,5,3,5,0,5,5,
               10,5,7,1,1,1,3,20,1,3,0,5,1,1,1,15,30,5,35,15,5,5,5,2,2,1,1,15,1,
               4,3,2,3,1,5,3,1,3,3,2,10,1,5,1,5,1,2,7,30,20,15,5,30,10,10,5,10,10,
               10,5,5,0,5,10,10,10,10,10,5,15,10,15,15,15,10,15,20,15,20,20,5,5,
               20,10,10,5,1,0,2,5,2,5,5,1,2,2,2,10,1,2,7,2,15,15,15,5,15,5,10,1,
               20,2,1,99,0,2,0,5,2,5,1,10,5,5,5,1,5,2,2,5,5,5,3,5,1,0,5,15,7,2,4,
               5,5,10,2,10,10,10,3,3,10,5,5,15,5,10,10,2,5,20,5,5,1,5,10,15,1,3,
               2,1,3,1,1,1,1,1,1,1,2,1,1,1,1,2,1,1,1,2,2,1,1,1,1,1,3,3,1,5,7,10,
               2,5,10,15,2,5,2,2,3,4,3,2,5,4,10,5,3,2,2,2,5,1,1,5,2,5,5,10,5,15,
               1,1,1,1,15,2,5,2,10,3,5,2,1,6,5,1,5,5,1,3,5,3,1,4,5,3,5,4,1,8,5,1,
               5,5,9,5,5,9,4,3,4,2,5,2,1,5,10,10,5,1,10,1,5,1,1,3,2,1,5,3,3,5,1,
               5,1,2,2,0,7,7,2,0,1,3,10,1,2,1,1,5,5,1,5,1,1,2,0,5,15,5,15,5,5,15,
               2,2,1,1,10,1,5,10,1,1,1,1,15,1,4,1,1,1,2,1,10,1,5,15,5,10,15,3,1,
               1,1,0,5,5,5,0,5,7,1,7,9,2,1,6,5,10,2,2,5,2,8,1,1,1,1,2,5,10,1,10,
               1,7,5,4,5,5,5,10,10,15,5,0,10,15,99,99,99,99,5,1,1,2,5,1,5,1,5,5,
               10,10,5,10,5,5,10,2,15,0,1,0,7,5,0,1,0,0,5,5,5,3,10,5,3,1,10,15,3,
               6,6,1,3,2,0,15,2,20,10,0,1,0,2,5,15,5,2,1,1,5,5,1,5,1,20,15,15,1,
               1,2,1,3,0,5,3,0,0,5,6,3,5,6,4,1,2,4,1,10,5,6,3,7,10,5,10,10,5,2,5,
               1,1,5,1,2,5,2,5,2,2,2,5,1,8,1,1,1,1,1,4,7,0,3,3,1,3,2,1,6,1,0,2,1,
               0,5,1,1,6,1,5,1,3,3,3,3,7,2,10,4,3,5,5,7,3,5,3,6,1,5,1,4,4,3,2,1,
               1,2,1,2,15,18,5,0,1,5,0,3,5,0,0,0,1,1,1,3,0,0,1,2,0,2,20,2,4,2,2,
               34,0,1,0,4,10,0,7)

thesisdata <- data.table(id = seq(1:length(parktimes)), 
                         parktime = parktimes)

Anscombe <- function(x) {

  # http://github.com/broxtronix/pymultiscale/blob/master/pymultiscale/anscombe.py

  # Compute the Anscombe variance stabilizing transform.

  # the input x is noisy Poisson-distributed data 
  # the output fx has variance approximately equal to 1.

  # Reference: Anscombe, F. J. (1948), "The transformation of Poisson,
  # binomial and negative-binomial data", Biometrika 35 (3-4): 246-254

  return (2.0 * sqrt(x + 3.0 / 8.0))
}


CalculatePoissonDist <- function(thesisdata, colnam) {

  # According to:
  # http://www.sqlservercentral.com/articles/scoring-outliers-in-non-normal-data-with-r

  # We're going to use the ppois() function to calculate an "outlier score" for 
  # every observation in our dataset. The intuitive way to think about this 
  # score is the "likelihood of observing a point this large". This is a 
  # somewhat loose interpretation of a p-value, but suitable for detecting 
  # outliers.
  # This function fails if input dataframe is not a data.table dataframe.


  # Calculate Poisson distribution for parktime or walktime. Creates two new
  # columns, Score (double) and Outlier (boolean). Explicitly prints results
  # and returns the inputted dataframe with updates.

  # Try Anscombe transform for the parameter column
  anscombe_col <- paste0("anscombe_", colnam)
  thesisdata[, (anscombe_col) := Anscombe(thesisdata[, get(colnam)])]

  # Calculate a "p-value" for outliers, based on the poisson probabilities.
  # Use get() to enable string column names in data.table syntax
  thesisdata[, Score := 1 - ppois(q = get(anscombe_col), 
                                  lambda = mean(get(anscombe_col)))]

  # Apply a Bonferroni correction factor to the p-value, to control the long-run 
  # error rate
  thesisdata[, Outlier := Score < 0.05 / 1000]

  # Add a Method column with all values "Poisson"
  thesisdata[, Method := "Poisson"]

  # Visualise the results
  p <- ggplot(thesisdata, aes(x = id, y = !!sym(colnam))) + 
    geom_point(aes(colour = Outlier), size = 3, alpha = 0.7) +
    scale_colour_manual(values = c("darkgrey", "red")) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
    theme_minimal()
  print(p)

  return(thesisdata) 
}

# Outliers in count data?
thesisdata <- CalculatePoissonDist(thesisdata, "parktime")
$\endgroup$
  • $\begingroup$ See ucslk.com/questions/13086. That will be more generally applicable to these data than attempting to model them as Poisson will be (because there appears to be no a priori reason to adopt a Poisson model--these times are not count data). $\endgroup$ – whuber Mar 26 at 10:30
  • $\begingroup$ @whuber sdat = Sort[data]; freq = Table[Count[sdat, i]/5222, {i, 0, 99}] $\endgroup$ – Carl Mar 26 at 19:45
2
$\begingroup$

The data is just too noisy to analyze by direct inspection, so I can understand why the question of outliers arose. However, identifying outliers requires that a logical, physical reason for that status to be at least postulated. The only outliers herein are the 99+ answers, which literally lie outside the range of the data. What is occurring with the human responses can be seen using a more precise histogram.

enter image description here

As seen in the minute by minute histogram the responses to your question as to how long it takes to park are responded to with human time estimates that increases at certain time intervals, 1, 5, 10, 15, 20, 25, 30... min. Which are clock face interval estimates. That is we are postulating is that it would be more frequent to say (approximately) 15 min rather than 14 or 16 min. Consequently, it is hard to find a distribution that fits the data as raw data. However, I did a Gaussian kernel smooth on the data (in Mathematica) just to get some idea of what it looks like and got.

enter image description here Following that I generated magnitudes from -10 to 109 (range extended because of the smoothing) and then tried to find a distribution for that (FindDistribution routine).

enter image description here

Now, without smoothing I got

enter image description here

About that, if one ignores the mixture distributions, which are attempting to model the noise, and not very successfully, one is left with a geometric distribution or a negative binomial distribution.

After smoothing the candidates are a gamma distribution or a beta distribution. I noticed that in the raw data the maximum value of 99 is populated several times, which is likely why the beta distribution was identified after smoothing.

Thinking physically about this problem, there are no whole number wait times. That is, no one parks at 1 min elapsed time exactly and actual times might be more closely 5341 milliseconds or 3 min 34.453 s. So a gamma distribution wait time model might be more appropriate. This is related to a Poisson process, and is a continuous model for it. I would suggest you fit a gamma CDF to the observed CDF, as that will damp the noise without falsifying the model.

To create the CDF, truncate the 99+ entries so that the CDF data for fitting stops at 0.994064, which is $1-\dfrac{31}{5222}$, where 31 is the number of 99+ answers, and 5222 the total number of realizations.

So, just for fun, I did that. The CDF gamma distribution is:

$$\begin{array}{cc} \Bigg\{ & \begin{array}{cc} Q\left(a,0,\frac{x}{b}\right) & x>0 \\ 0 & \text{Elsewhere} \\ \end{array} \\ \end{array}\text{ },$$

where $Q(\cdot,\cdot,\cdot)$ is the generalized regularized incomplete gamma function, and careful as Mathematica might parametrize as b or 1/b compared to other implementations. The coefficients I got from ordinary least squares regression were $a=0.6618887062, b=6.679277804$ and the fit plot was this: enter image description here

I note that it works a bit more realistically if I shift the data one min to the right. In that case $a=1.113789864, b=4.648996063$. Then as $a>1$, the pdf gamma distribution assigns 0 probability of parking in 0 time (which is physical because human reaction time is not zero, it can be within the first minute, which is <1 but not zero. Same confusion for birthdays, the first birthday is when the first year has finished.) and has a peak at 0.529008630 min, as below

enter image description here

Which has the following density formula:

$$\frac{b^{-a} t^{a-1} e^{-\frac{t}{b}}}{\Gamma (a)},$$ where $t$ is time in minutes, and where $a=1.11379, b=4.64900$-min, and $a$ has no units (dimensionless). That is, $$0.190915 e^{-0.215100 t} t^{0.113790}.$$

BTW, the median wait estimate is 3-min from the raw data.

$\endgroup$
  • $\begingroup$ Thank you for your thorough answer. I can't, however, sort out your conclusion to my problem. Would you kindly elaborate how the CDF helps in detecting the outliers in my skewed data? Are all the plots you display made in Mathematica? $\endgroup$ – Vesanen Mar 25 at 19:14
  • 1
    $\begingroup$ The only outliers are the 99+ answers. Everything else is good data. The CDF was used for noise reduction. The "noise" in the data is due to human nature as in "give me a minute" or "wait a second." The choice of 5, 10, 15,... minutes is due to clock divisions which is how most people, even in this digital age, learn to tell time. Wait times have heavy tails to the right, that is not "outlier" activity, it is data. $\endgroup$ – Carl Mar 26 at 0:35
  • $\begingroup$ Yes, Mathematica. If you want the "notebook" i.e., the Mathematica code, I will figure out how to post it. $\endgroup$ – Carl Mar 26 at 1:02
  • $\begingroup$ Thank you again for your swift, illustrative answers. I would appreciate the Mathematica code, if you have the time to post it. $\endgroup$ – Vesanen Mar 26 at 9:28
  • $\begingroup$ OK, working on it, but accept the answer and vote for it. I up voted your question, so you have enough reputation points to up vote. $\endgroup$ – Carl Mar 26 at 19:14

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.