Exploratory Analysis
Basic Statistics & Outlier Detection
Basic Statistics
- Numerical Variables
- Categorical variables
- Variance Inflation Factor of Numerical Variables
An average event in New York has tickets and is not free. It costs about 23 dollars, has 25 words in the description box, lasts a little over an hour, usually takes place on Friday or Thursday nights in Greenwich Village or East Village, has 10 public transportation options around, with the closest transportation available within 87 meters (2 min walk), and surrounded by 900 businesses.
In order to detect multicollinearity, we introduced The Variance Inflation Factor, which is a measure of collinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variance of a single beta if it were fit alone. As a rule of thumb, VIF values in excess of 5 or 10 are often considered an indication that multicollinearity may by a cause of problem.
Explore Event Venues
It seems the number of businesses or transportations around an event does not attract more participants on average. This contradicts our intuition that events held near city center would be more popular in general.
Outlier Detection
Box plots for continuous variables.
Method 1: box plots visualization
Method 2: any row that contains data 3 standard deviations away from its mean (897 detected)
Method 3: multidimensional outliers detection using LOF threshold 20 neighbors, score > 1.75 (627 detected)