title	Apache Spark and Apache Hadoop
titleSuffix	Configure Apache Spark and Apache Hadoop in Big Data Clusters
description	SQL Server Big Data Clusters allow Spark and HDFS solutions. Learn how to configure them.
author	rajmera3
ms.author	raajmera
ms.reviewer	mikeray
ms.date	02/13/2020
ms.topic	conceptual
ms.prod	sql
ms.technology	big-data-cluster

Configure Apache Spark and Apache Hadoop in Big Data Clusters

In order to configure Apache Spark and Apache Hadoop in Big Data Clusters, you need to modify the cluster profile at deployment time.

Supported Configurations

Currently we have four configuration categories:

sql
hdfs
spark
gateway

In Big Data Clusters, we define 3 services: hdfs, spark, sql. Coincidentally each service maps to the same named configuration category. All gateway configurations go to category gateway.

For example, all configurations in service hdfs belong to category hdfs. Note that all Hadoop (core-site), HDFS and Zookeeper configurations belong to category hdfs; all Livy/Spark/Yarn/Hive Metastore configurations belong to category "spark".

You can find all possible configurations for each at the associated Apache documentation site:

Apache Spark: https://spark.apache.org/docs/latest/configuration.html
Apache Hadoop:
Hive: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-MetaStore
Livy: https://github.com/cloudera/livy/blob/master/conf/livy.conf.template
Apache Knox Gateway: https://knox.apache.org/books/knox-0-14-0/user-guide.html#Gateway+Details

Unsupported Configurations

The following configurations are unsupported and cannot be changed in the context of the Big Data Cluster.

Category	Sub-Category	File	Unsupported Configurations
spark
	yarn-site	yarn-site.xml	yarn.log-aggregation-enable
			yarn.log.server.url
			yarn.nodemanager.pmem-check-enabled
			yarn.nodemanager.vmem-check-enabled
			yarn.nodemanager.aux-services
			yarn.resourcemanager.address
			yarn.nodemanager.address
			yarn.client.failover-no-ha-proxy-provider
			yarn.client.failover-proxy-provider
			yarn.http.policy
			yarn.nodemanager.linux-container-executor.secure-mode.use-pool-user
			yarn.nodemanager.linux-container-executor.secure-mode.pool-user-prefix
			yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user
			yarn.acl.enable
			yarn.admin.acl
			yarn.resourcemanager.hostname
			yarn.resourcemanager.principal
			yarn.resourcemanager.keytab
			yarn.resourcemanager.webapp.spnego-keytab-file
			yarn.resourcemanager.webapp.spnego-principal
			yarn.nodemanager.principal
			yarn.nodemanager.keytab
			yarn.nodemanager.webapp.spnego-keytab-file
			yarn.nodemanager.webapp.spnego-principal
			yarn.resourcemanager.ha.enabled
			yarn.resourcemanager.cluster-id
			yarn.resourcemanager.zk-address
			yarn.resourcemanager.ha.rm-ids
			yarn.resourcemanager.hostname.*

	capacity-scheduler	capacity-scheduler.xml	yarn.scheduler.capacity.root.acl_submit_applications
			yarn.scheduler.capacity.root.acl_administer_queue
			yarn.scheduler.capacity.root.default.acl_application_max_priority

	yarn-env	yarn-env.sh
	spark-defaults-conf	spark-defaults.conf	spark.yarn.archive
			spark.yarn.historyServer.address
			spark.eventLog.enabled
			spark.eventLog.dir
			spark.sql.warehouse.dir
			spark.sql.hive.metastore.version
			spark.sql.hive.metastore.jars
			spark.extraListeners
			spark.metrics.conf
			spark.ssl.enabled
			spark.authenticate
			spark.network.crypto.enabled
			spark.ssl.keyStore
			spark.ssl.keyStorePassword

			spark.ui.enabled
	spark-env	spark-env.sh	SPARK_NO_DAEMONIZE
			SPARK_DIST_CLASSPATH

	spark-history-server-conf	spark-history-server.conf	spark.history.fs.logDirectory
			spark.ui.proxyBase
			spark.history.fs.cleaner.enabled
			spark.ssl.enabled
			spark.authenticate
			spark.network.crypto.enabled
			spark.ssl.keyStore
			spark.ssl.keyStorePassword
			spark.history.kerberos.enabled
			spark.history.kerberos.principal
			spark.history.kerberos.keytab
			spark.ui.filters
			spark.acls.enable
			spark.history.ui.acls.enable
			spark.history.ui.admin.acls
			spark.history.ui.admin.acls.groups

	livy-conf	livy.conf	livy.keystore
			livy.keystore.password
			livy.spark.master
			livy.spark.deploy-mode
			livy.rsc.jars
			livy.repl.jars
			livy.rsc.pyspark.archives
			livy.rsc.sparkr.package
			livy.repl.enable-hive-context
			livy.superusers
			livy.server.auth.type
			livy.server.launch.kerberos.keytab
			livy.server.launch.kerberos.principal
			livy.server.auth.kerberos.principal
			livy.server.auth.kerberos.keytab
			livy.impersonation.enabled
			livy.server.access-control.enabled
			livy.server.access-control.*

	livy-env	livy-env.sh	LIVY_SERVER_JAVA_OPTS
	hive-site	hive-site.xml	javax.jdo.option.ConnectionURL
			javax.jdo.option.ConnectionDriverName
			javax.jdo.option.ConnectionUserName
			javax.jdo.option.ConnectionPassword
			hive.metastore.uris
			hive.metastore.pre.event.listeners
			hive.security.authorization.enabled
			hive.security.metastore.authenticator.manager
			hive.security.metastore.authorization.manager
			hive.metastore.use.SSL
			hive.metastore.keystore.path
			hive.metastore.keystore.password
			hive.metastore.truststore.path
			hive.metastore.truststore.password
			hive.metastore.kerberos.keytab.file
			hive.metastore.kerberos.principal
			hive.metastore.sasl.enabled
			hive.metastore.execute.setugi
			hive.cluster.delegation.token.store.class

	hive-env	hive-env.sh

hdfs
	core-site	core-site.xml	fs.defaultFS
			ha.zookeeper.quorum
			hadoop.tmp.dir
			hadoop.rpc.protection
			hadoop.security.auth_to_local
			hadoop.security.authentication
			hadoop.security.authorization
			hadoop.http.authentication.simple.anonymous.allowed
			hadoop.http.authentication.type
			hadoop.http.authentication.kerberos.principal
			hadoop.http.authentication.kerberos.keytab
			hadoop.http.filter.initializers
			hadoop.security.group.mapping.*

	hadoop-env	hadoop-env.sh	JAVA_HOME
			HADOOP_CLASSPATH

	mapred-env	mapred-env.sh
	hdfs-site	hdfs-site.xml	dfs.namenode.name.dir
			dfs.datanode.data.dir
			dfs.namenode.acls.enabled
			dfs.namenode.datanode.registration.ip-hostname-check
			dfs.client.retry.policy.enabled
			dfs.permissions.enabled
			dfs.nameservices
			dfs.ha.namenodes.nmnode-0
			dfs.namenode.rpc-address.nmnode-0.*
			dfs.namenode.shared.edits.dir
			dfs.ha.automatic-failover.enabled
			dfs.ha.fencing.methods
			dfs.journalnode.edits.dir
			dfs.client.failover.proxy.provider.nmnode-0
			dfs.namenode.http-address
			dfs.namenode.httpS-address
			dfs.http.policy
			dfs.encrypt.data.transfer
			dfs.block.access.token.enable
			dfs.data.transfer.protection
			dfs.encrypt.data.transfer.cipher.suites
			dfs.https.port
			dfs.namenode.keytab.file
			dfs.namenode.kerberos.principal
			dfs.namenode.kerberos.internal.spnego.principal
			dfs.datanode.data.dir.perm
			dfs.datanode.address
			dfs.datanode.http.address
			dfs.datanode.ipc.address
			dfs.datanode.https.address
			dfs.datanode.keytab.file
			dfs.datanode.kerberos.principal
			dfs.journalnode.keytab.file
			dfs.journalnode.kerberos.principal
			dfs.journalnode.kerberos.internal.spnego.principal
			dfs.web.authentication.kerberos.keytab
			dfs.web.authentication.kerberos.principal
			dfs.webhdfs.enabled
			dfs.permissions.superusergroup

	hdfs-env	hdfs-env.sh	HADOOP_HEAPSIZE_MAX

	zoo-cfg	zoo.cfg	secureClientPort
			clientPort
			dataDir
			dataLogDir
			4lw.commands.whitelist

	zookeeper-java-env	java.env	ZK_LOG_DIR
			SERVER_JVMFLAGS

	zookeeper-log4j-properties	log4j.properties (zookeeper)	log4j.rootLogger
			log4j.appender.CONSOLE.*

gateway
	gateway-site	gateway-site.xml	gateway.port
			gateway.path
			gateway.gateway.conf.dir
			gateway.hadoop.kerberos.secured
			java.security.krb5.conf
			java.security.auth.login.config
			gateway.websocket.feature.enabled
			gateway.scope.cookies.feature.enabled
			ssl.exclude.protocols
			ssl.include.ciphers

Configurations via Cluster Profile

In the cluster profile there are resources and services. At deployment time, we can specify configurations in one of two ways:

First, at the resource level:

The following examples are the patch files for the profile:

{ 
      "op": "add", 
      "path": "spec.resources.zookeeper.spec.settings", 
      "value": { 
        "hdfs": { 
          "zoo-cfg.syncLimit": "6" 
        } 
      } 
}

Or:

{ 
      "op": "add", 
      "path": "spec.resources.gateway.spec.settings", 
      "value": { 
        "gateway": { 
          "gateway-site.gateway.httpclient.socketTimeout": "95s" 
        } 
      } 
}

Second, at the service level. Assign multiple resources to a service, and specify configurations to the service.

The following is an example of the patch file for the profile:

{ 
      "op": "add", 
      "path": "spec.services.hdfs.settings", 
      "value": { 
        “core-site.hadoop.proxyuser.xyz.users”: “*” 
        … 
     } 
}

The service hdfs is defined as:

{ 
  "spec": { 
   "services": { 
     "hdfs": { 
        "resources": [ 
          "nmnode-0", 
          "zookeeper", 
          "storage-0", 
          "sparkhead" 
        ], 
        "settings":{ 
          "hdfs-site.dfs.replication": "3" 
        } 
      } 
    } 
  } 
}

Note

Resource level configurations override service level configurations. One resource can be assigned to multiple services.

Limitations

Configurations can only be specified at category level. To specify multiple configurations with the same sub-category, we cannot extract the common prefix in cluster profile.

{ 
      "op": "add", 
      "path": "spec.services.hdfs.settings.core-site.hadoop", 
      "value": { 
        “proxyuser.xyz.users”: “*”, 
        “proxyuser.abc.users”: “*” 
     } 
}

Next steps

azdata reference
[What are [!INCLUDEbig-data-clusters-2019]?](big-data-cluster-overview.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure Apache Spark and Apache Hadoop in Big Data Clusters

Supported Configurations

Unsupported Configurations

Configurations via Cluster Profile

Limitations

Next steps

FilesExpand file tree

configure-spark-hdfs.md

Latest commit

History

configure-spark-hdfs.md

File metadata and controls

Configure Apache Spark and Apache Hadoop in Big Data Clusters

Supported Configurations

Unsupported Configurations

Configurations via Cluster Profile

Limitations

Next steps