Exploratory Analysis

Basic Statistics & Outlier Detection

Basic Statistics

  • Numerical Variables
  • Categorical variables
  • Variance Inflation Factor of Numerical Variables

An average event in New York has tickets and is not free. It costs about 23 dollars, has 25 words in the description box, lasts a little over an hour, usually takes place on Friday or Thursday nights in Greenwich Village or East Village, has 10 public transportation options around, with the closest transportation available within 87 meters (2 min walk), and surrounded by 900 businesses.

In order to detect multicollinearity, we introduced The Variance Inflation Factor, which is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone. As a rule of thumb, VIF values in excess of 5 or 10 are often considered an indication that multicollinearity may by a cause of problem.


Outlier Detection

Box plots for continuous variables.
Method 1: box plots visualization
Method 2: any row that contains data 3 standard deviations away from its mean (897 detected)
Method 3: multidimensional outliers detection using LOF threshold 20 neighbors, score > 1.75 (627 detected)