Big Data
Hamish Burke | 2025-02-20
Related to: #bigData
Course Info
- Course Page
- 3 late days
- Not applied to test or group part of A3
People
- Dr Qi Chen (Coordinator)
- Dr Bach Nguyen
What is big data?
Hugee data, like so big it can't fit on one computer
The 5V's in Big data
Using Big data
- Predict Flu trends (Google flu trends)
- Took 50 mil most common search queries between 2004-2008
- what like all queries? or things related to the flu
- Then GTF started to consistently overestimate the flu trends
- why?
- some people searching up symptoms are not the actual amount o people getting the flu
- combing with two week old CDC data still provided a good model
- Took 50 mil most common search queries between 2004-2008
Feature Manipulation
- Feature ranking
- dimensionality reduction
- subset selection (different to ranking?)
- feature construction/creation (what ember did)
- feature transformation
Its a relevant feature if it makes it easier to separate the different classes with it
Diagnostic vs non-diagnostic features
- ???
The data density decreases exponentially with dimensionality
- Meaning less samples per region
- Less data to train the model
- Therefore having redundant/irrelevant features can reduce the performance of the model
- Redundant meaning you already a another feature conveying the information
- IE age of person AND birth year (direct correlation)
- Redundant meaning you already a another feature conveying the information
Feature Selection
- Select relevant features
Terms
-
Classical
- select m features where $$m \subset n \ and \ m<n$$
-
Idealised
- Find minimally sized feature subset that is necessary and sufficient
- Improve classification accuracy/reduce complexity
- Approximating the original class distribution
- Find minimally sized feature subset that is necessary and sufficient
Why?
- Save money (smaller model/faster)
- Help with visualisation and interpretation
Single Feature Ranking
- Use an algorithm to measure the importance of each feature individually
- Eg for decision trees
- The frequency of the feature for splits can be showed to measure the importance of them
- Can plot number of features used vs accuracy
- Find the maximum number of features
- Eg for decision trees
Filter Approach
- No learning algorithm during feature selection
Pearson's Correlation
- If i increase one variable, the value of the other variable is liekly to increase
Mutual Information
- A measure of the mutual dependence between the two variables