-
Big Data
Hamish Burke | 2025-02-20
Related to: #bigData
-
Course Info
-
3 late days
- Not applied to test or group part of A3
-
People
-
Dr Qi Chen (Coordinator)
-
-
What is big data?
Hugee data, like so big it can't fit on one computer
-
The 5V's in Big data
-
Feature Manipulation
-
Feature ranking
-
dimensionality reduction
-
subset selection (different to ranking?)
-
feature construction/creation (what ember did)
-
feature transformation
Its a relevant feature if it makes it easier to separate the different classes with it
Diagnostic vs non-diagnostic features
-
???
The data density decreases exponentially with dimensionality
-
Meaning less samples per region
-
Less data to train the model
-
Therefore having redundant/irrelevant features can reduce the performance of the model
- Redundant meaning you already a another feature conveying the information
- IE age of person AND birth year (direct correlation)
- Redundant meaning you already a another feature conveying the information
-
Dimensionality Reduciton
-
Manifold learning
- Tries to preserve structure
- We assume our data is lower-dimensional than the D features we have
- Preserve the geodesic distance
- Shortest path in a graph
- Eg to fly from auckland to london, the eclidian distance would be straight through the earth
- Through the geodesic distance would be following the topology, around the earth
-
Multidimensional Scaling (MDS)
-
Minimise difference between pairwise distance (between instances) in the high dimension vs embedding space
-
Metric MDS
- Preserves magnitude and orders
- Assume triangle inequality
- Really sensitive to outliers
-
NonMetric MDS
- Preserve orders
- Less sensitive to triangle inequality and outliers
- Lost information
-
Locally Linear Embedding (LLE)
-
- Select K neighbours for each datapoint
-
Compute weights to reconstruct
from its neighbours -
Find the embedding that minimises the reconstruction error
Hadoop MapReduce (MR)
- MR distributes the data over multiple commodity machines
- MR moves computation rather than data, because analysing the bug data in a single machine takes a lot of time
- MRs data-parallel programming model hides complexity of distribution and fault tolerance
- Writes intermediate result to disk
- Reminds me of bit count compression algorithm (keeping track of counts of inputs in map)
- Groups value pairs with same keys
Combiner (mini-reducer)
Goal is to aggregate the key val pairs ([apple,1],[apple,1]) -> ([apple,2])
- Combiner only aggregate the pairs from a single machine (compared to the reducer)
Apache Spark
- Cluster computing framework
- In memory computing (doesn't write intermediate result to disk)
- Much faster than Hadoop
- API's for Scala, Java, Python, SQL, R
Architecture
- Master/slave architecture
- Single driver (SparkContext)
- Cluster manager communicating between driver
- Resource manager
- And any number of workers
- Each worker has an executor
- Which performs tasks, interacts wuith storage systems, stores computation results (in memory)
spark-submit
./bin/spark-submit --master <url> --deploy-mode <mode> app.py [args]
Deploy modes:
- Cluster mode
- Launcehs driver inside one of the work nodes
- Client mode (default)
- Launches the driver locally as an external client