Search notes:
Apache Spark
Spark is a fast and general cluster computing system for
Big Data
.
High-level APIs in
Scala
,
Java
,
Python
, and
R
In contrast to
Hadoop MapReduce
, which is disk based, Spark uses memory. Apparently, this is useful for
machine learning
algorithms.
Spark runs on
Hadoop
(
YARN
?)
Apache Mesos
Kubernetes
standalone
In the
Cloud
Spark accesses various
data
sources.
Originally developed in 2009 in UC Berkeley's AMP lab.
Fully open sourced in 2010 - now a top level project at the Apache Software Foundation
Documentation
http://spark.apache.org/documentation.html
https://cwiki.apache.org/confluence/display/SPARK
Concepts
RDD Resilient Distributed Datasets
Transformations
Actions
TODOs
MLlib: machine learning
GraphX: graph processing
Spark Streaming
See also
SQL Server: Big Data Clusters
Microsoft's implementation of Spark is
Azure HDInsight
.
Compare also with Azure
Databricks
and
Synapse Analytics
.
Links
http://spark.apache.org
.
http://github.com/apache/spark
Index