Spark命令

刘超 16天前 ⋅ 4395 阅读   编辑

目录

  1、spark-submit命令
    1.1 语法
    1.2 选项
    1.3 需求
      a、log日志
      b、jdk
      c、jvmoption
      d、类
    1.4 示例
      a、以特定用户提交任务

  2、spark-shell命令
    1.1 语法
    1.2 选项
    1.3 命令

  3、spark-sql命令

  4、

 

一、spark-submit

  1、语法

  spark-submit [option] jar包 [main参数]

  2、选项[option]

参数名 说明 示例
--master local、standalone、yarn(yarn-client、yarn-master)
1、local:本地
2、standalone:独立集群,也有master、slave,但只能运行spark
3、yarn:使用 Yarn 的模式
  a、yarn-client:用于测试,因为Driver运行在本地客户端,这样便于调试,但Driver没有容错,一旦挂掉就真的挂了;等同于 -master yarn -deploy-mode client, 此时不需要指定deploy-mode
  b、yarn-master: 用于生产环境,Driver在yarn中执行,如果挂掉,yarn负责重启Driver,也就是有了容错机制,但由于运行在yarn中,这样不利于调错;等同于 -master yarn —deploy-mode cluster, 此时不需要指定deploy-mode
--master spark://datascienceresearch-01:7077
--master yarn // 提交到yarn上
--deploy-mode client、cluster
1、client:client 模式表示作业的 AM 会放在 Master 节点上运行。要注意的是,如果设置这个参数,那么需要同时指定上面 master 为 yarn
2、cluster:cluster 模式表示 AM 会随机的在 worker 节点中的任意一台上启动运行。要注意的是,如果设置这个参数,那么需要同时指定上面 master 为yarn
 
--executor-memory 各个executor使用的最大内存,不可超过单机的最大可使用内存 --executor-memory 15G
--executor-cores 各个 executor 使用的并发线程数目,也即每个 executor 最大可并发执行的 Task 数目 --executor-cores 4
--driver-memory driver使用的内存,不可超过单机的core总数 --driver-memory 4G
--num-executor 创建多少个executor --num-executor 2
--driver-java-options 
driver配置  
--files  可以配置文件等  
--jars 指定jar包,jar包之间用逗号,分割  
--class 指定class --class com.GroupMain
--name 指定名称 --name sparkpi
--queue 指定队列 --queue spark
--version
查看版本  
--conf 配置
spark.default.parallelism --conf spark.default.parallelism=1000
spark.storage.memoryFraction --conf spark.storage.memoryFraction=0.2
spark.shuffle.memoryFraction --conf spark.shuffle.memoryFraction=0.2
spark.executor.memory --conf spark.executor.memory=3G
spark.dynamicAllocation.maxExecutors --conf spark.dynamicAllocation.maxExecutors=2
spark.locality.wait.node --conf spark.locality.wait.node=0
spark.executor.userClassPathFirst --conf spark.executor.userClassPathFirst=true
spark.driver.userClassPathFirst --conf spark.driver.userClassPathFirst=true
spark.yarn.appMasterEnv.JAVA_HOME --conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/java/jdk1.8.0_162
spark.executorEnv.JAVA_HOME --conf spark.executorEnv.JAVA_HOME=/usr/java/jdk1.8.0_162
  3、需求
    a、log日志
    设置driver log,与--driver-java-options "-Dlog4j.configuration=file:/absolute/path/to/your/log4j.properties 等价
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties"
    --driver-java-options "-Dlog4j.configuration=file:/absolute/path/to/your/log4j.properties"
    设置executor log
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties"
    b、jdk
    设置driver jdk
    --conf spark.yarn.appMasterEnv.JAVA_HOME=/usr/java/jdk1.8.0_162 
    设置executor jdk
    --conf spark.executorEnv.JAVA_HOME=/usr/java/jdk1.8.0_162 
    c、jvmoption
    // PermSize
    spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M
    d、类
    // 查看加载了哪些类
    --driver-java-options "-verbose:class" 
    --conf "spark.executor.extraJavaOptions=-verbose:class"
  4、例子

  a、提交spark任务例子

spark-submit --class cn.ce.data.dim.LoadToDimEmployee --master yarn --deploy-mode cluster --files /etc/hive/conf/hive-site.xml hdfs://bigdatacluster/task/spark/taurus-1.0-SNAPSHOT.jar

/usr/bin/spark-submit --master yarn --deploy-mode client --queue adx --class com.opera.adx.job.Stat --conf spark.default.parallelism=160 --executor-cores 4 --num-executors 10 --executor-memory 4G --d river-memory 1G --jars hdfs://ha地址/lib/zero-allocation-hashing-0.8.jar,hdfs://ha地址/lib/twirl-api_2.11-1.1.1.jar,hdfs://ha地址/lib/tispark-core-1.2-jar-with-dependencies.jar ./adx_stat_2.11-0.3-SNAPSHOT.jar adx_request 20190619 day official -1

  b、以特定用户提交任务

方法一、sudo -su username spark-submit --class com.MyClass

方法二、指定 --proxy-user xxx

方法三、先su - username,然后spark-submit --class com.MyClass

  c、查看版本

spark-submit --version

二、spark-shell命令

  1、语法

  spark-shell [option]

  2、选项[option]

参数名 说明 示例
--master local、yarn 使用yarn执行,能打印出来applicationId,可以根据applicationId访问SparkUI --master yarn
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client)
--class CLASS_NAME  Your application's main class (for Java / Scala apps)
--jars 指定第三方jar,如果有多个jar包需要导入,中间用逗号隔开
--packages  Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts
--name NAME A name of your application
--repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps
--files FILES Comma-separated list of files to be placed in the working directory of each executor
--conf PROP=VALUE Arbitrary Spark configuration property
--properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf
--driver-memory MEM  Memory for driver (e.g. 1000M, 2G) (Default: 1024M)
--driver-java-options Extra Java options to pass to the driver
--driver-library-path Extra library path entries to pass to the driver
--driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G)
--proxy-user NAME  User to impersonate when submitting the application. This argument does not work with --principal / --keytab.
--help, -h  Show this help message and exit.
--verbose, -v Print additional debug output.
--version 查看版本
 Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1)
Spark standalone or Mesos with cluster deploy mode only:
--supervise    If given, restarts the driver on failure.
 --kill SUBMISSION_ID  If given, kills the driver specified
 --status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM   Total cores for all executors
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default")
--num-executors NUM  Number of executors to launch (Default: 2).  If dynamic allocation is enabled, the initial number of executors will be at least NUM.
--archives ARCHIVES  Comma separated list of archives to be extracted into the working directory of each executor
--principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS
--keytab KEYTAB  The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically

执行后spark-shell [option] 后,就可以执行如下命令
说明
:edit edit history
:help [command] print this summary or command-specific help
:history [num] show the history (optional num is commands to show)
:h? search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v] show the implicits in scope
:javap <path|class> disassemble a file or class name
:line place line(s) at the end of history
:load interpret lines in a file
:paste [-raw] [path] enter paste mode or paste a file
:power enable power user mode
:quit exit the interpreter
:replay [options] reset the repl and replay all previous commands
:require add a jar to the classpath
:reset [options] reset the repl to its initial state, forgetting all session entries
:save save replayable session to a file
:sh run a shell command (result is implicitly => List[String])
:settings update compiler options, if possible; see reset
:silent disable/enable automatic printing of results
:type [-v] display the type of an expression without evaluating it
:kind [-v] display the kind of expression's type
:warnings show the suppressed warnings from the most recent line which had any
   

  3、参数

  通过spark.sql("set -v").show(1000, truncate=false),有哪些参数

// Spark 2.0+
spark.conf.set("spark.executor.memory","4g")
// Spark < 2.0
import org.apache.spark.{SparkContext, SparkConf}
sc.stop()
val conf = new SparkConf().set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)

  5、需求

  -- 查看版本

spark-submit --version

  -- 更改日志级别为debug

三、spark-sql命令

 

 


注意:本文归作者所有,未经作者允许,不得转载

全部评论: 0

    我有话说: