Big Data

Hamish Burke | 2025-02-20

Related to: #bigData
Course Info
Course Page
3 late days
- Not applied to test or group part of A3
People
Dr Qi Chen (Coordinator)
Dr Bach Nguyen
What is big data?

Hugee data, like so big it can't fit on one computer
The 5V's in Big data
1. Volume
2. Variety
3. Velocity
4. Value
5. Veracity
6. Viscosity
7. Variability
8. Visualisation
Using Big data
Feature Manipulation
Feature ranking
dimensionality reduction
subset selection (different to ranking?)
feature construction/creation (what ember did)
feature transformation

Its a relevant feature if it makes it easier to separate the different classes with it

Diagnostic vs non-diagnostic features
???

The data density decreases exponentially with dimensionality
Meaning less samples per region
Less data to train the model
Therefore having redundant/irrelevant features can reduce the performance of the model
- Redundant meaning you already a another feature conveying the information
  - IE age of person AND birth year (direct correlation)
Feature Selection
Correlation
Principled Component Analysis
Clustering Metrics
Hierarchical Clustering
Convex Clustering
Dimensionality Reduciton
Manifold learning
- Tries to preserve structure
- We assume our data is lower-dimensional than the D features we have
- Preserve the geodesic distance
  - Shortest path in a graph
  - Eg to fly from auckland to london, the eclidian distance would be straight through the earth
  - Through the geodesic distance would be following the topology, around the earth
Multidimensional Scaling (MDS)
Minimise difference between pairwise distance (between instances) in the high dimension vs embedding space
Metric MDS
- Preserves magnitude and orders
- Assume triangle inequality
- Really sensitive to outliers
NonMetric MDS
- Preserve orders
- Less sensitive to triangle inequality and outliers
- Lost information
Locally Linear Embedding (LLE)
- Select K neighbours for each datapoint
Compute weights to reconstruct $X_{i}$ from its neighbours
Find the embedding that minimises the reconstruction error

Hadoop MapReduce (MR)

MR distributes the data over multiple commodity machines
MR moves computation rather than data, because analysing the bug data in a single machine takes a lot of time
MRs data-parallel programming model hides complexity of distribution and fault tolerance
Writes intermediate result to disk
Reminds me of bit count compression algorithm (keeping track of counts of inputs in map)
- Groups value pairs with same keys

Combiner (mini-reducer)

Goal is to aggregate the key val pairs ([apple,1],[apple,1]) -> ([apple,2])

Combiner only aggregate the pairs from a single machine (compared to the reducer)

Apache Spark

Cluster computing framework
In memory computing (doesn't write intermediate result to disk)
- Much faster than Hadoop
API's for Scala, Java, Python, SQL, R

Architecture

Master/slave architecture
Single driver (SparkContext)
Cluster manager communicating between driver
- Resource manager
And any number of workers
Each worker has an executor
- Which performs tasks, interacts wuith storage systems, stores computation results (in memory)

spark-submit

./bin/spark-submit --master <url> --deploy-mode <mode> app.py [args]

Deploy modes:

Cluster mode
- Launcehs driver inside one of the work nodes
Client mode (default)
- Launches the driver locally as an external client

Big Data

Course Info

People

What is big data?

The 5V's in Big data

Feature Manipulation

Dimensionality Reduciton

Manifold learning

Multidimensional Scaling (MDS)

Locally Linear Embedding (LLE)

Hadoop MapReduce (MR)

Combiner (mini-reducer)

Apache Spark

Architecture

spark-submit