Skip to content

Latest commit

 

History

History
339 lines (310 loc) · 31 KB

File metadata and controls

339 lines (310 loc) · 31 KB
title Apache Spark and Apache Hadoop
titleSuffix Configure Apache Spark and Apache Hadoop in Big Data Clusters
description SQL Server Big Data Clusters allow Spark and HDFS solutions. Learn how to configure them.
author rajmera3
ms.author raajmera
ms.reviewer mikeray
ms.date 02/13/2020
ms.topic conceptual
ms.prod sql
ms.technology big-data-cluster

Configure Apache Spark and Apache Hadoop in Big Data Clusters

In order to configure Apache Spark and Apache Hadoop in Big Data Clusters, you need to modify the cluster profile at deployment time.

Supported Configurations

Currently we have four configuration categories:

  • sql
  • hdfs
  • spark
  • gateway

In Big Data Clusters, we define 3 services: hdfs, spark, sql. Coincidentally each service maps to the same named configuration category. All gateway configurations go to category gateway.

For example, all configurations in service hdfs belong to category hdfs. Note that all Hadoop (core-site), HDFS and Zookeeper configurations belong to category hdfs; all Livy/Spark/Yarn/Hive Metastore configurations belong to category "spark".

You can find all possible configurations for each at the associated Apache documentation site:

Unsupported Configurations

The following configurations are unsupported and cannot be changed in the context of the Big Data Cluster.

Category  Sub-Category  File  Unsupported Configurations 
spark 
yarn-site  yarn-site.xml  yarn.log-aggregation-enable 
yarn.log.server.url 
yarn.nodemanager.pmem-check-enabled 
yarn.nodemanager.vmem-check-enabled 
yarn.nodemanager.aux-services 
yarn.resourcemanager.address 
yarn.nodemanager.address 
yarn.client.failover-no-ha-proxy-provider 
yarn.client.failover-proxy-provider 
yarn.http.policy 
yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user 
yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user 
yarn.acl.enable 
yarn.admin.acl 
yarn.resourcemanager.hostname 
yarn.resourcemanager.principal 
yarn.resourcemanager.keytab 
yarn.resourcemanager.webapp.spnego-keytab-file 
yarn.resourcemanager.webapp.spnego-principal 
yarn.nodemanager.principal 
yarn.nodemanager.keytab 
yarn.nodemanager.webapp.spnego-keytab-file 
yarn.nodemanager.webapp.spnego-principal 
yarn.resourcemanager.ha.enabled 
yarn.resourcemanager.cluster-id 
yarn.resourcemanager.zk-address 
yarn.resourcemanager.ha.rm-ids 
yarn.resourcemanager.hostname.* 
capacity-scheduler  capacity-scheduler.xml  yarn.scheduler.capacity.root.acl_submit_applications 
yarn.scheduler.capacity.root.acl_administer_queue 
yarn.scheduler.capacity.root.default.acl_application_max_priority 
yarn-env  yarn-env.sh 
spark-defaults-conf  spark-defaults.conf  spark.yarn.archive 
spark.yarn.historyServer.address 
spark.eventLog.enabled 
spark.eventLog.dir 
spark.sql.warehouse.dir 
spark.sql.hive.metastore.version 
spark.sql.hive.metastore.jars 
spark.extraListeners 
spark.metrics.conf 
spark.ssl.enabled 
spark.authenticate 
spark.network.crypto.enabled 
spark.ssl.keyStore 
spark.ssl.keyStorePassword 
spark.ui.enabled 
spark-env  spark-env.sh  SPARK_NO_DAEMONIZE 
SPARK_DIST_CLASSPATH 
spark-history-server-conf  spark-history-server.conf  spark.history.fs.logDirectory 
spark.ui.proxyBase 
spark.history.fs.cleaner.enabled 
spark.ssl.enabled 
spark.authenticate 
spark.network.crypto.enabled 
spark.ssl.keyStore 
spark.ssl.keyStorePassword 
spark.history.kerberos.enabled 
spark.history.kerberos.principal 
spark.history.kerberos.keytab 
spark.ui.filters 
spark.acls.enable 
spark.history.ui.acls.enable 
spark.history.ui.admin.acls 
spark.history.ui.admin.acls.groups 
livy-conf  livy.conf  livy.keystore 
livy.keystore.password 
livy.spark.master 
livy.spark.deploy-mode 
livy.rsc.jars 
livy.repl.jars 
livy.rsc.pyspark.archives 
livy.rsc.sparkr.package 
livy.repl.enable-hive-context 
livy.superusers 
livy.server.auth.type 
livy.server.launch.kerberos.keytab 
livy.server.launch.kerberos.principal 
livy.server.auth.kerberos.principal 
livy.server.auth.kerberos.keytab 
livy.impersonation.enabled 
livy.server.access-control.enabled 
livy.server.access-control.* 
livy-env  livy-env.sh  LIVY_SERVER_JAVA_OPTS 
hive-site  hive-site.xml  javax.jdo.option.ConnectionURL 
javax.jdo.option.ConnectionDriverName 
javax.jdo.option.ConnectionUserName 
javax.jdo.option.ConnectionPassword 
hive.metastore.uris 
hive.metastore.pre.event.listeners 
hive.security.authorization.enabled 
hive.security.metastore.authenticator.manager 
hive.security.metastore.authorization.manager 
hive.metastore.use.SSL 
hive.metastore.keystore.path 
hive.metastore.keystore.password 
hive.metastore.truststore.path 
hive.metastore.truststore.password 
hive.metastore.kerberos.keytab.file 
hive.metastore.kerberos.principal 
hive.metastore.sasl.enabled 
hive.metastore.execute.setugi 
hive.cluster.delegation.token.store.class 
hive-env  hive-env.sh 
hdfs 
core-site  core-site.xml  fs.defaultFS 
ha.zookeeper.quorum 
hadoop.tmp.dir 
hadoop.rpc.protection 
hadoop.security.auth_to_local 
hadoop.security.authentication 
hadoop.security.authorization 
hadoop.http.authentication.simple.anonymous.allowed 
hadoop.http.authentication.type 
hadoop.http.authentication.kerberos.principal 
hadoop.http.authentication.kerberos.keytab 
hadoop.http.filter.initializers 
hadoop.security.group.mapping.* 
hadoop-env  hadoop-env.sh  JAVA_HOME 
HADOOP_CLASSPATH 
mapred-env  mapred-env.sh 
hdfs-site  hdfs-site.xml  dfs.namenode.name.dir 
dfs.datanode.data.dir 
dfs.namenode.acls.enabled 
dfs.namenode.datanode.registration.ip-hostname-check 
dfs.client.retry.policy.enabled 
dfs.permissions.enabled 
dfs.nameservices 
dfs.ha.namenodes.nmnode-0 
dfs.namenode.rpc-address.nmnode-0.* 
dfs.namenode.shared.edits.dir 
dfs.ha.automatic-failover.enabled 
dfs.ha.fencing.methods 
dfs.journalnode.edits.dir 
dfs.client.failover.proxy.provider.nmnode-0 
dfs.namenode.http-address 
dfs.namenode.httpS-address 
dfs.http.policy 
dfs.encrypt.data.transfer 
dfs.block.access.token.enable 
dfs.data.transfer.protection 
dfs.encrypt.data.transfer.cipher.suites 
dfs.https.port 
dfs.namenode.keytab.file 
dfs.namenode.kerberos.principal 
dfs.namenode.kerberos.internal.spnego.principal 
dfs.datanode.data.dir.perm 
dfs.datanode.address 
dfs.datanode.http.address 
dfs.datanode.ipc.address 
dfs.datanode.https.address 
dfs.datanode.keytab.file 
dfs.datanode.kerberos.principal 
dfs.journalnode.keytab.file 
dfs.journalnode.kerberos.principal 
dfs.journalnode.kerberos.internal.spnego.principal 
dfs.web.authentication.kerberos.keytab 
dfs.web.authentication.kerberos.principal 
dfs.webhdfs.enabled 
dfs.permissions.superusergroup 
hdfs-env  hdfs-env.sh  HADOOP_HEAPSIZE_MAX 
zoo-cfg  zoo.cfg  secureClientPort 
clientPort 
dataDir 
dataLogDir 
4lw.commands.whitelist 
zookeeper-java-env  java.env  ZK_LOG_DIR 
SERVER_JVMFLAGS 
zookeeper-log4j-properties  log4j.properties (zookeeper)  log4j.rootLogger 
log4j.appender.CONSOLE.* 
gateway 
gateway-site  gateway-site.xml  gateway.port 
gateway.path 
gateway.gateway.conf.dir 
gateway.hadoop.kerberos.secured 
java.security.krb5.conf 
java.security.auth.login.config 
gateway.websocket.feature.enabled 
gateway.scope.cookies.feature.enabled 
ssl.exclude.protocols 
ssl.include.ciphers 

Configurations via Cluster Profile

In the cluster profile there are resources and services. At deployment time, we can specify configurations in one of two ways:

  • First, at the resource level:

    The following examples are the patch files for the profile:

    { 
          "op": "add", 
          "path": "spec.resources.zookeeper.spec.settings", 
          "value": { 
            "hdfs": { 
              "zoo-cfg.syncLimit": "6" 
            } 
          } 
    }

    Or:

    { 
          "op": "add", 
          "path": "spec.resources.gateway.spec.settings", 
          "value": { 
            "gateway": { 
              "gateway-site.gateway.httpclient.socketTimeout": "95s" 
            } 
          } 
    } 
  • Second, at the service level. Assign multiple resources to a service, and specify configurations to the service.

    The following is an example of the patch file for the profile:

    { 
          "op": "add", 
          "path": "spec.services.hdfs.settings", 
          "value": { 
            “core-site.hadoop.proxyuser.xyz.users”: “*” 
            
         } 
    } 

The service hdfs is defined as:

{ 
  "spec": { 
   "services": { 
     "hdfs": { 
        "resources": [ 
          "nmnode-0", 
          "zookeeper", 
          "storage-0", 
          "sparkhead" 
        ], 
        "settings":{ 
          "hdfs-site.dfs.replication": "3" 
        } 
      } 
    } 
  } 
} 

Note

Resource level configurations override service level configurations. One resource can be assigned to multiple services.

Limitations

Configurations can only be specified at category level. To specify multiple configurations with the same sub-category, we cannot extract the common prefix in cluster profile.

{ 
      "op": "add", 
      "path": "spec.services.hdfs.settings.core-site.hadoop", 
      "value": { 
        “proxyuser.xyz.users”: “*”, 
        “proxyuser.abc.users”: “*” 
     } 
} 

Next steps