2.1 Local 模式

Local 模式就是指的只在一台计算机上来运行 Spark.

通常用于测试的目的来使用 Local 模式, 实际的生产环境中不会使用 Local 模式.

2.1.1 解压 Spark 安装包

把安装包上传到/opt/software/下, 并解压到/opt/module/目录下

tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module

然后复制刚刚解压得到的目录, 并命名为spark-local:

cp -r spark-2.1.1-bin-hadoop2.7 spark-local

2.1.2 运行官方求`PI`的案例

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[2] \
./examples/jars/spark-examples_2.11-2.1.1.jar 100

注意:

如果你的shell是使用的zsh, 则需要把local[2]加上引号:'local[2]'

说明:

使用spark-submit来发布应用程序.
语法:
```
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
```
- --master 指定 master 的地址，默认为local. 表示在本机运行.
- --class 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
- --deploy-mode 是否发布你的驱动到 worker节点(cluster 模式) 或者作为一个本地客户端 (client 模式) (default: client)
- --conf: 任意的 Spark 配置属性，格式key=value. 如果值包含空格，可以加引号"key=value"
- application-jar: 打包好的应用 jar,包含依赖. 这个 URL 在集群中全局可见。比如hdfs:// 共享存储系统，如果是 file:// path，那么所有的节点的path都包含同样的jar
- application-arguments: 传给main()方法的参数
- --executor-memory 1G 指定每个executor可用内存为1G
- --total-executor-cores 6 指定所有executor使用的cpu核数为6个
- --executor-cores 表示每个executor使用的 cpu 的核数
关于 Master URL 的说明

Master URL	Meaning
`local`	Run Spark locally with one worker thread (i.e. no parallelism at all).
`local[K]`	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
`local[*]`	Run Spark locally with as many worker threads as logical cores on your machine.
`spark://HOST:PORT`	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
`mesos://HOST:PORT`	Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use `mesos://zk://...`. To submit with `--deploy-mode cluster`, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on the value of `--deploy-mode`. The cluster location will be found based on the `HADOOP_CONF_DIR` or `YARN_CONF_DIR` variable.

结果展示

该算法是利用蒙特·卡罗算法求PI

备注: 也可以使用`run-examples`来运行

bin/run-example SparkPi 100

2.1.3 使用 Spark-shell

Spark-shell 是 Spark 给我们提供的交互式命令窗口(类似于 Scala 的 REPL)

本案例在 Spark-shell 中使用 Spark 来统计文件中各个单词的数量.

步骤1: 创建 2 个文本文件

mkdir input
cd input
touch 1.txt
touch 2.txt

分别在 1.txt 和 2.txt 内输入一些单词.

步骤2: 打开 Spark-shell

bin/spark-shell

步骤3: 查看进程和通过 web 查看应用程序运行情况

地址: http://hadoop201:4040

步骤4: 运行 `wordcount` 程序

sc.textFile("input/").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).collect

步骤5: 登录`hadoop201:4040`查看程序运行

2.1.4 提交流程

Spark 通用运行简易流程

播放 ppt

2.1.5 wordcount 数据流程分析:

播放 ppt

textFile("input")：读取本地文件input文件夹数据；
flatMap(_.split(" "))：压平操作，按照空格分割符将一行数据映射成一个个单词；
map((_,1))：对每一个元素操作，将单词映射为元组；
reduceByKey(_+_)：按照key将值进行聚合，相加；
collect：将数据收集到Driver端展示。

2.1 Local 模式

2.1 Local 模式

2.1.1 解压 Spark 安装包

2.1.2 运行官方求`PI`的案例

注意:

说明:

结果展示

备注: 也可以使用`run-examples`来运行

2.1.3 使用 Spark-shell

步骤1: 创建 2 个文本文件

步骤2: 打开 Spark-shell

步骤3: 查看进程和通过 web 查看应用程序运行情况

步骤4: 运行 `wordcount` 程序

步骤5: 登录`hadoop201:4040`查看程序运行

2.1.4 提交流程

2.1.5 wordcount 数据流程分析:

results matching ""

No results matching ""

2.1 Local 模式

2.1.1 解压 Spark 安装包

2.1.2 运行官方求PI的案例

注意:

说明:

结果展示

备注: 也可以使用run-examples来运行

2.1.3 使用 Spark-shell

步骤1: 创建 2 个文本文件

步骤2: 打开 Spark-shell

步骤3: 查看进程和通过 web 查看应用程序运行情况

步骤4: 运行 wordcount 程序

步骤5: 登录hadoop201:4040查看程序运行

2.1.4 提交流程

2.1.5 wordcount 数据流程分析:

results matching ""

No results matching ""

2.1.2 运行官方求`PI`的案例

备注: 也可以使用`run-examples`来运行

步骤4: 运行 `wordcount` 程序

步骤5: 登录`hadoop201:4040`查看程序运行