提示信息

spark,security

韦海涫 已关注

layout: global
displayTitle: Spark Security
title: Security
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

  1. http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and

limitations under the License.

  • This will become a table of contents (this text will be scraped).
    {:toc}

Spark Security: Things You Need To Know

Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet
or an untrusted network, it's important to secure access to the cluster to prevent unauthorized applications
from running on the cluster.

Spark supports multiple deployments types and each one supports different levels of security. Not
all deployment types will be secure in all environments and none are secure by default. Be
sure to evaluate your environment, what Spark supports, and take the appropriate measure to secure
your Spark deployment.

There are many different types of security concerns. Spark does not necessarily protect against
all things. Listed below are some of the things Spark supports. Also check the deployment
documentation for the type of deployment you are using for deployment specific settings. Anything
not documented, Spark does not support.

Spark RPC (Communication protocol between Spark processes)

Authentication

Spark currently supports authentication for RPC channels using a shared secret. Authentication can
be turned on by setting the spark.authenticate configuration parameter.

The exact mechanism used to generate and distribute the shared secret is deployment-specific. Unless
specified below, the secret must be defined by setting the spark.authenticate.secret config
option. The same secret is shared by all Spark applications and daemons in that case, which limits
the security of these deployments, especially on multi-tenant clusters.

The REST Submission Server and the MesosClusterDispatcher do not support authentication. You should
ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077
respectively by default) are restricted to hosts that are trusted to submit jobs.

YARN

For Spark on YARN, Spark will automatically handle generating and
distributing the shared secret. Each application will use a unique shared secret. In
the case of YARN, this feature relies on YARN RPC encryption being enabled for the distribution of
secrets to be secure.

Kubernetes

On Kubernetes, Spark will also automatically generate an authentication secret unique to each
application. The secret is propagated to executor pods using environment variables. This means
that any user that can list pods in the namespace where the Spark application is running can
also see their authentication secret. Access control rules should be properly set up by the
Kubernetes admin to ensure that Spark authentication is secure.

Property NameDefaultMeaningSince Version
spark.authenticate false Whether Spark authenticates its internal connections. 1.0.0
spark.authenticate.secret None The secret key used authentication. See above for when this configuration should be set. 1.0.0

Alternatively, one can mount authentication secrets using files and Kubernetes secrets that
the user mounts into their pods.

Property NameDefaultMeaningSince Version
spark.authenticate.secret.file None Path pointing to the secret key to use for securing connections. Ensure that the contents of the file have been securely generated. This file is loaded on both the driver and the executors unless other settings override this (see below). 3.0.0
spark.authenticate.secret.driver.file The value of spark.authenticate.secret.file When specified, overrides the location that the Spark driver reads to load the secret. Useful when in client mode, when the location of the secret file may differ in the pod versus the node the driver is running in. When this is specified, spark.authenticate.secret.executor.file must be specified so that the driver and the executors can both use files to load the secret key. Ensure that the contents of the file on the driver is identical to the contents of the file on the executors. 3.0.0
spark.authenticate.secret.executor.file The value of spark.authenticate.secret.file When specified, overrides the location that the Spark executors read to load the secret. Useful in client mode, when the location of the secret file may differ in the pod versus the node the driver is running in. When this is specified, spark.authenticate.secret.driver.file must be specified so that the driver and the executors can both use files to load the secret key. Ensure that the contents of the file on the driver is identical to the contents of the file on the executors. 3.0.0

Note that when using files, Spark will not mount these files into the containers for you. It is up
you to ensure that the secret files are deployed securely into your containers and that the driver's
secret file agrees with the executors' secret file.

Encryption

Spark supports AES-based encryption for RPC connections. For encryption to be enabled, RPC
authentication must also be enabled and properly configured. AES encryption uses the
Apache Commons Crypto library, and Spark's
configuration system allows access to that library's configuration for advanced users.

There is also support for SASL-based encryption, although it should be considered deprecated. It
is still required when talking to shuffle services from Spark versions older than 2.2.0.

The following table describes the different options available for configuring this feature.

Property NameDefaultMeaningSince Version
spark.network.crypto.enabled false Enable AES-based RPC encryption, including the new authentication protocol added in 2.2.0. 2.2.0
spark.network.crypto.keyLength 128 The length in bits of the encryption key to generate. Valid values are 128, 192 and 256. 2.2.0
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1 The key factory algorithm to use when generating encryption keys. Should be one of the algorithms supported by the javax.crypto.SecretKeyFactory class in the JRE being used. 2.2.0
spark.network.crypto.config.* None Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix. 2.2.0
spark.network.crypto.saslFallback true Whether to fall back to SASL authentication if authentication fails using Spark's internal mechanism. This is useful when the application is connecting to old shuffle services that do not support the internal Spark authentication protocol. On the shuffle service side, disabling this feature will block older clients from authenticating. 2.2.0
spark.authenticate.enableSaslEncryption false Enable SASL-based encrypted communication. 2.2.0
spark.network.sasl.serverAlwaysEncrypt false Disable unencrypted connections for ports using SASL authentication. This will deny connections from clients that have authentication enabled, but do not request SASL-based encryption. 1.4.0

Local Storage Encryption

Spark supports encrypting temporary data written to local disks. This covers shuffle files, shuffle
spills and data blocks stored on disk (for both caching and broadcast variables). It does not cover
encrypting output data generated by applications with APIs such as saveAsHadoopFile or
saveAsTable. It also may not cover temporary files created explicitly by the user.

The following settings cover enabling encryption for data written to disk:

Property NameDefaultMeaningSince Version
spark.io.encryption.enabled false Enable local disk I/O encryption. Currently supported by all modes except Mesos. It's strongly recommended that RPC encryption be enabled when using this feature. 2.1.0
spark.io.encryption.keySizeBits 128 IO encryption key size in bits. Supported values are 128, 192 and 256. 2.1.0
spark.io.encryption.keygen.algorithm HmacSHA1 The algorithm to use when generating the IO encryption key. The supported algorithms are described in the KeyGenerator section of the Java Cryptography Architecture Standard Algorithm Name Documentation. 2.1.0
spark.io.encryption.commons.config.* None Configuration values for the commons-crypto library, such as which cipher implementations to use. The config name should be the name of commons-crypto configuration without the commons.crypto prefix. 2.1.0

Web UI

Authentication and Authorization

Enabling authentication for the Web UIs is done using javax servlet filters.
You will need a filter that implements the authentication method you want to deploy. Spark does not
provide any built-in authentication filters.

Spark also supports access control to the UI when an authentication filter is present. Each
application can be configured with its own separate access control lists (ACLs). Spark
differentiates between "view" permissions (who is allowed to see the application's UI), and "modify"
permissions (who can do things like kill jobs in a running application).

ACLs can be configured for either users or groups. Configuration entries accept comma-separated
lists as input, meaning multiple users or groups can be given the desired privileges. This can be
used if you run on a shared cluster and have a set of administrators or developers who need to
monitor applications they may not have started themselves. A wildcard (*) added to specific ACL
means that all users will have the respective privilege. By default, only the user submitting the
application is added to the ACLs.

Group membership is established by using a configurable group mapping provider. The mapper is
configured using the spark.user.groups.mapping config option, described in the table
below.

The following options control the authentication of Web UIs:

Property NameDefaultMeaningSince Version
spark.ui.filters None See the Spark UI configuration for how to configure filters. 1.0.0
spark.acls.enable false Whether UI ACLs should be enabled. If enabled, this checks to see if the user has access permissions to view or modify the application. Note this requires the user to be authenticated, so if no authentication filter is installed, this option does not do anything. 1.1.0
spark.admin.acls None Comma-separated list of users that have view and modify access to the Spark application. 1.1.0
spark.admin.acls.groups None Comma-separated list of groups that have view and modify access to the Spark application. 2.0.0
spark.modify.acls None Comma-separated list of users that have modify access to the Spark application. 1.1.0
spark.modify.acls.groups None Comma-separated list of groups that have modify access to the Spark application. 2.0.0
spark.ui.view.acls None Comma-separated list of users that have view access to the Spark application. 1.0.0
spark.ui.view.acls.groups None Comma-separated list of groups that have view access to the Spark application. 2.0.0
spark.user.groups.mapping org.apache.spark.security.ShellBasedGroupsMappingProvider The list of groups for a user is determined by a group mapping service defined by the trait org.apache.spark.security.GroupMappingServiceProvider, which can be configured by this property.
By default, a Unix shell-based implementation is used, which collects this information from the host OS.
Note: This implementation supports only Unix/Linux-based environments. Windows environment is currently not supported. However, a new platform/protocol can be supported by implementing the trait mentioned above.
2.0.0

On YARN, the view and modify ACLs are provided to the YARN service when submitting applications, and
control who has the respective privileges via YARN interfaces.

Spark History Server ACLs

Authentication for the SHS Web UI is enabled the same way as for regular applications, using
servlet filters.

To enable authorization in the SHS, a few extra options are used:

Property NameDefaultMeaningSince Version
spark.history.ui.acls.enable false Specifies whether ACLs should be checked to authorize users viewing the applications in the history server. If enabled, access control checks are performed regardless of what the individual applications had set for spark.ui.acls.enable. The application owner will always have authorization to view their own application and any users specified via spark.ui.view.acls and groups specified via spark.ui.view.acls.groups when the application was run will also have authorization to view that application. If disabled, no access control checks are made for any application UIs available through the history server. 1.0.1
spark.history.ui.admin.acls None Comma separated list of users that have view access to all the Spark applications in history server. 2.1.1
spark.history.ui.admin.acls.groups None Comma separated list of groups that have view access to all the Spark applications in history server. 2.1.1

The SHS uses the same options to configure the group mapping provider as regular applications.
In this case, the group mapping provider will apply to all UIs server by the SHS, and individual
application configurations will be ignored.

SSL Configuration

Configuration for SSL is organized hierarchically. The user can configure the default SSL settings
which will be used for all the supported communication protocols unless they are overwritten by
protocol-specific settings. This way the user can easily provide the common settings for all the
protocols without disabling the ability to configure each one individually. The following table
describes the SSL configuration namespaces:

Config Namespace Component
spark.ssl The default SSL configuration. These values will apply to all namespaces below, unless explicitly overridden at the namespace level.
spark.ssl.ui Spark application Web UI
spark.ssl.standalone Standalone Master / Worker Web UI
spark.ssl.historyServer History Server Web UI

The full breakdown of available SSL options can be found below. The ${ns} placeholder should be
replaced with one of the above namespaces.

Property NameDefaultMeaning
${ns}.enabled false Enables SSL. When enabled, ${ns}.ssl.protocol is required.
${ns}.port None The port where the SSL service will listen on.
The port must be defined within a specific namespace configuration. The default namespace is ignored when reading this configuration.
When not set, the SSL port will be derived from the non-SSL port for the same service. A value of "0" will make the service bind to an ephemeral port.
${ns}.enabledAlgorithms None A comma-separated list of ciphers. The specified ciphers must be supported by JVM.
The reference list of protocols can be found in the "JSSE Cipher Suite Names" section of the Java security guide. The list for Java 8 can be found at this page.
Note: If not set, the default cipher suite for the JRE will be used.
${ns}.keyPassword None The password to the private key in the key store.
${ns}.keyStore None Path to the key store file. The path can be absolute or relative to the directory in which the process is started.
${ns}.keyStorePassword None Password to the key store.
${ns}.keyStoreType JKS The type of the key store.
${ns}.protocol None TLS protocol to use. The protocol must be supported by JVM.
The reference list of protocols can be found in the "Additional JSSE Standard Names" section of the Java security guide. For Java 8, the list can be found at this page.
${ns}.needClientAuth false Whether to require client authentication.
${ns}.trustStore None Path to the trust store file. The path can be absolute or relative to the directory in which the process is started.
${ns}.trustStorePassword None Password for the trust store.
${ns}.trustStoreType JKS The type of the trust store.

Spark also supports retrieving ${ns}.keyPassword, ${ns}.keyStorePassword and ${ns}.trustStorePassword from
Hadoop Credential Providers.
User could store password into credential file and make it accessible by different components, like:

  1. hadoop credential create spark.ssl.keyPassword -value password \
  2. -provider jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks

To configure the location of the credential provider, set the hadoop.security.credential.provider.path
config option in the Hadoop configuration used by Spark, like:

  1. <property>
  2. <name>hadoop.security.credential.provider.path</name>
  3. <value>jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks</value>
  4. </property>

Or via SparkConf "spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/ssl.jceks".

Preparing the key stores

Key stores can be generated by keytool program. The reference documentation for this tool for
Java 8 is here.
The most basic steps to configure the key stores and the trust store for a Spark Standalone
deployment mode is as follows:

  • Generate a key pair for each node
  • Export the public key of the key pair to a file on each node
  • Import all exported public keys into a single trust store
  • Distribute the trust store to the cluster nodes

YARN mode

To provide a local trust store or key store file to drivers running in cluster mode, they can be
distributed with the application using the --files command line argument (or the equivalent
spark.files configuration). The files will be placed on the driver's working directory, so the TLS
configuration should just reference the file name with no absolute path.

Distributing local key stores this way may require the files to be staged in HDFS (or other similar
distributed file system used by the cluster), so it's recommended that the underlying file system be
configured with security in mind (e.g. by enabling authentication and wire encryption).

Standalone mode

The user needs to provide key stores and configuration options for master and workers. They have to
be set by attaching appropriate Java system properties in SPARK_MASTER_OPTS and in
SPARK_WORKER_OPTS environment variables, or just in SPARK_DAEMON_JAVA_OPTS.

The user may allow the executors to use the SSL settings inherited from the worker process. That
can be accomplished by setting spark.ssl.useNodeLocalConf to true. In that case, the settings
provided by the user on the client side are not used.

Mesos mode

Mesos 1.3.0 and newer supports Secrets primitives as both file-based and environment based
secrets. Spark allows the specification of file-based and environment variable based secrets with
spark.mesos.driver.secret.filenames and spark.mesos.driver.secret.envkeys, respectively.

Depending on the secret store backend secrets can be passed by reference or by value with the
spark.mesos.driver.secret.names and spark.mesos.driver.secret.values configuration properties,
respectively.

Reference type secrets are served by the secret store and referred to by name, for example
/mysecret. Value type secrets are passed on the command line and translated into their
appropriate files or environment variables.

HTTP Security Headers

Apache Spark can be configured to include HTTP headers to aid in preventing Cross Site Scripting
(XSS), Cross-Frame Scripting (XFS), MIME-Sniffing, and also to enforce HTTP Strict Transport
Security.

Property NameDefaultMeaningSince Version
spark.ui.xXssProtection 1; mode=block Value for HTTP X-XSS-Protection response header. You can choose appropriate value from below:
  • 0 (Disables XSS filtering)
  • 1 (Enables XSS filtering. If a cross-site scripting attack is detected, the browser will sanitize the page.)
  • 1; mode=block (Enables XSS filtering. The browser will prevent rendering of the page if an attack is detected.)
2.3.0
spark.ui.xContentTypeOptions.enabled true When enabled, X-Content-Type-Options HTTP response header will be set to "nosniff". 2.3.0
spark.ui.strictTransportSecurity None Value for HTTP Strict Transport Security (HSTS) Response Header. You can choose appropriate value from below and set expire-time accordingly. This option is only used when SSL/TLS is enabled.
  • max-age=<expire-time>
  • max-age=<expire-time>; includeSubDomains
  • max-age=<expire-time>; preload
2.3.0

Configuring Ports for Network Security

Generally speaking, a Spark cluster and its services are not deployed on the public internet.
They are generally private services, and should only be accessible within the network of the
organization that deploys Spark. Access to the hosts and ports used by Spark services should
be limited to origin hosts that need to access the services.

Below are the primary ports that Spark uses for its communication and how to
configure those ports.

Standalone mode only

FromToDefault PortPurposeConfiguration SettingNotes
Browser Standalone Master 8080 Web UI spark.master.ui.port
  • /
  • SPARK_MASTER_WEBUI_PORT
    Jetty-based. Standalone mode only.
    Browser Standalone Worker 8081 Web UI spark.worker.ui.port
  • /
  • SPARK_WORKER_WEBUI_PORT
    Jetty-based. Standalone mode only.
    Driver
  • /
  • Standalone Worker
    Standalone Master 7077 Submit job to cluster
  • /
  • Join cluster
    SPARK_MASTER_PORT Set to "0" to choose a port randomly. Standalone mode only.
    External Service Standalone Master 6066 Submit job to cluster via REST API spark.master.rest.port Use spark.master.rest.enabled to enable/disable this service. Standalone mode only.
    Standalone Master Standalone Worker (random) Schedule executors SPARK_WORKER_PORT Set to "0" to choose a port randomly. Standalone mode only.

    All cluster managers

    FromToDefault PortPurposeConfiguration SettingNotes
    Browser Application 4040 Web UI spark.ui.port Jetty-based
    Browser History Server 18080 Web UI spark.history.ui.port Jetty-based
    Executor
  • /
  • Standalone Master
    Driver (random) Connect to application
  • /
  • Notify executor state changes
    spark.driver.port Set to "0" to choose a port randomly.
    Executor / Driver Executor / Driver (random) Block Manager port spark.blockManager.port Raw socket via ServerSocketChannel

    Kerberos

    Spark supports submitting applications in environments that use Kerberos for authentication.
    In most cases, Spark relies on the credentials of the current logged in user when authenticating
    to Kerberos-aware services. Such credentials can be obtained by logging in to the configured KDC
    with tools like kinit.

    When talking to Hadoop-based services, Spark needs to obtain delegation tokens so that non-local
    processes can authenticate. Spark ships with support for HDFS and other Hadoop file systems, Hive
    and HBase.

    When using a Hadoop filesystem (such HDFS or WebHDFS), Spark will acquire the relevant tokens
    for the service hosting the user's home directory.

    An HBase token will be obtained if HBase is in the application's classpath, and the HBase
    configuration has Kerberos authentication turned (hbase.security.authentication=kerberos).

    Similarly, a Hive token will be obtained if Hive is in the classpath, and the configuration includes
    URIs for remote metastore services (hive.metastore.uris is not empty).

    If an application needs to interact with other secure Hadoop filesystems, their URIs need to be
    explicitly provided to Spark at launch time. This is done by listing them in the
    spark.kerberos.access.hadoopFileSystems property, described in the configuration section below.

    Spark also supports custom delegation token providers using the Java Services
    mechanism (see java.util.ServiceLoader). Implementations of
    org.apache.spark.security.HadoopDelegationTokenProvider can be made available to Spark
    by listing their names in the corresponding file in the jar's META-INF/services directory.

    Delegation token support is currently only supported in YARN and Mesos modes. Consult the
    deployment-specific page for more information.

    The following options provides finer-grained control for this feature:

    Property NameDefaultMeaningSince Version
    spark.security.credentials.${service}.enabled true Controls whether to obtain credentials for services when security is enabled. By default, credentials for all supported services are retrieved when those services are configured, but it's possible to disable that behavior if it somehow conflicts with the application being run. 2.3.0
    spark.kerberos.access.hadoopFileSystems (none) A comma-separated list of secure Hadoop filesystems your Spark application is going to access. For example, spark.kerberos.access.hadoopFileSystems=hdfs://nn1.com:8032,hdfs://nn2.com:8032, webhdfs://nn3.com:50070. The Spark application must have access to the filesystems listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm). Spark acquires security tokens for each of the filesystems so that the Spark application can access those remote Hadoop filesystems. 3.0.0

    Users can exclude Kerberos delegation token renewal at resource scheduler. Currently it is only supported
    on YARN. The configuration is covered in the Running Spark on YARN page.

    Long-Running Applications

    Long-running applications may run into issues if their run time exceeds the maximum delegation
    token lifetime configured in services it needs to access.

    This feature is not available everywhere. In particular, it's only implemented
    on YARN and Kubernetes (both client and cluster modes), and on Mesos when using client mode.

    Spark supports automatically creating new tokens for these applications. There are two ways to
    enable this functionality.

    Using a Keytab

    By providing Spark with a principal and keytab (e.g. using spark-submit with --principal
    and --keytab parameters), the application will maintain a valid Kerberos login that can be
    used to retrieve delegation tokens indefinitely.

    Note that when using a keytab in cluster mode, it will be copied over to the machine running the
    Spark driver. In the case of YARN, this means using HDFS as a staging area for the keytab, so it's
    strongly recommended that both YARN and HDFS be secured with encryption, at least.

    Using a ticket cache

    By setting spark.kerberos.renewal.credentials to ccache in Spark's configuration, the local
    Kerberos ticket cache will be used for authentication. Spark will keep the ticket renewed during its
    renewable life, but after it expires a new ticket needs to be acquired (e.g. by running kinit).

    It's up to the user to maintain an updated ticket cache that Spark can use.

    The location of the ticket cache can be customized by setting the KRB5CCNAME environment
    variable.

    Secure Interaction with Kubernetes

    When talking to Hadoop-based services behind Kerberos, it was noted that Spark needs to obtain delegation tokens
    so that non-local processes can authenticate. These delegation tokens in Kubernetes are stored in Secrets that are
    shared by the Driver and its Executors. As such, there are three ways of submitting a Kerberos job:

    In all cases you must define the environment variable: HADOOP_CONF_DIR or
    spark.kubernetes.hadoop.configMapName.

    It also important to note that the KDC needs to be visible from inside the containers.

    If a user wishes to use a remote HADOOP_CONF directory, that contains the Hadoop configuration files, this could be
    achieved by setting spark.kubernetes.hadoop.configMapName to a pre-existing ConfigMap.

    1. Submitting with a $kinit that stores a TGT in the Local Ticket Cache:
    /usr/bin/kinit -kt <keytab_file> <username>/<krb5 realm>  
    /opt/spark/bin/spark-submit \  
        --deploy-mode cluster \  
        --class org.apache.spark.examples.HdfsTest \  
        --master k8s://<KUBERNETES_MASTER_ENDPOINT> \  
        --conf spark.executor.instances=1 \  
        --conf spark.app.name=spark-hdfs \  
        --conf spark.kubernetes.container.image=spark:latest \  
        --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \  
        local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \  
        <HDFS_FILE_LOCATION>  
    
    1. Submitting with a local Keytab and Principal
    /opt/spark/bin/spark-submit \  
        --deploy-mode cluster \  
        --class org.apache.spark.examples.HdfsTest \  
        --master k8s://<KUBERNETES_MASTER_ENDPOINT> \  
        --conf spark.executor.instances=1 \  
        --conf spark.app.name=spark-hdfs \  
        --conf spark.kubernetes.container.image=spark:latest \  
        --conf spark.kerberos.keytab=<KEYTAB_FILE> \  
        --conf spark.kerberos.principal=<PRINCIPAL> \  
        --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \  
        local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \  
        <HDFS_FILE_LOCATION>  
    
    1. Submitting with pre-populated secrets, that contain the Delegation Token, already existing within the namespace
    /opt/spark/bin/spark-submit \  
        --deploy-mode cluster \  
        --class org.apache.spark.examples.HdfsTest \  
        --master k8s://<KUBERNETES_MASTER_ENDPOINT> \  
        --conf spark.executor.instances=1 \  
        --conf spark.app.name=spark-hdfs \  
        --conf spark.kubernetes.container.image=spark:latest \  
        --conf spark.kubernetes.kerberos.tokenSecret.name=<SECRET_TOKEN_NAME> \  
        --conf spark.kubernetes.kerberos.tokenSecret.itemKey=<SECRET_ITEM_KEY> \  
        --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \  
        local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \  
        <HDFS_FILE_LOCATION>  
    

    3b. Submitting like in (3) however specifying a pre-created krb5 ConfigMap and pre-created HADOOP_CONF_DIR ConfigMap

    /opt/spark/bin/spark-submit \  
        --deploy-mode cluster \  
        --class org.apache.spark.examples.HdfsTest \  
        --master k8s://<KUBERNETES_MASTER_ENDPOINT> \  
        --conf spark.executor.instances=1 \  
        --conf spark.app.name=spark-hdfs \  
        --conf spark.kubernetes.container.image=spark:latest \  
        --conf spark.kubernetes.kerberos.tokenSecret.name=<SECRET_TOKEN_NAME> \  
        --conf spark.kubernetes.kerberos.tokenSecret.itemKey=<SECRET_ITEM_KEY> \  
        --conf spark.kubernetes.hadoop.configMapName=<HCONF_CONFIG_MAP_NAME> \  
        --conf spark.kubernetes.kerberos.krb5.configMapName=<KRB_CONFIG_MAP_NAME> \  
        local:///opt/spark/examples/jars/spark-examples_<VERSION>.jar \  
        <HDFS_FILE_LOCATION>  
    

    Event Logging

    If your applications are using event logging, the directory where the event logs go
    (spark.eventLog.dir) should be manually created with proper permissions. To secure the log files,
    the directory permissions should be set to drwxrwxrwxt. The owner and group of the directory
    should correspond to the super user who is running the Spark History Server.

    This will allow all users to write to the directory but will prevent unprivileged users from
    reading, removing or renaming a file unless they own it. The event log files will be created by
    Spark with permissions such that only the user and group have read and write access.

    Persisting driver logs in client mode

    If your applications persist driver logs in client mode by enabling spark.driver.log.persistToDfs.enabled,
    the directory where the driver logs go (spark.driver.log.dfsDir) should be manually created with proper
    permissions. To secure the log files, the directory permissions should be set to drwxrwxrwxt. The owner
    and group of the directory should correspond to the super user who is running the Spark History Server.

    This will allow all users to write to the directory but will prevent unprivileged users from
    reading, removing or renaming a file unless they own it. The driver log files will be created by
    Spark with permissions such that only the user and group have read and write access.

      遇到难题? "AI大模型GPT4.0、GPT" 是你的私人解答专家! 点击按钮去提问......
    韦海涫 关注 已关注

    最近一次登录:2024-11-20 09:44:22   

    暂时还没有签名,请关注我或评论我的文章

    我爱黄河
    11月02日

    学习了很多Spark的安全配置细节,对于在企业中应用有很大的指导意义。

    渴求: @我爱黄河

    在学习Spark安全配置时,掌握一些具体的配置方法能够更大程度地提高应用的安全性。例如,可以通过设置Spark的应用程序的认证机制来加强安全性。使用以下配置可以启用Kerberos认证:

    spark-submit --master yarn \
      --conf 'spark.yarn.principal=your_principal' \
      --conf 'spark.yarn.keytab=/path/to/your.keytab' \
      --conf 'spark.authenticate=true' \
      --conf 'spark.yarn.access.hadoop.file.system=*,file:///' \
      your_application.jar
    

    此外,建议深入了解Spark的安全特性,如数据加密和访问控制策略,这些都有助于防止未授权访问。可以参考 Apache Spark Security Documentation 来获取更多有效的配置和最佳实践。维护安全策略不仅是保护数据的需要,更是提升整体系统信任度的关键步骤。通过不断学习和应用这些配置,可以为企业的安全管理打下坚实的基础。

    11月12日 回复 举报
    空城
    11月06日

    关于加密传输的配置,给出的具体参数名称很有用。开启AES加密可以用spark.network.crypto.enabled=true

    花葬: @空城

    关于加密传输的配置,确实是保护数据安全的重要措施。除了启用 spark.network.crypto.enabled=true 外,还可以考虑设置 spark.network.crypto.algorithm=AES/CTR/NoPadding 来指定具体的加密算法,确保在传输过程中的数据安全性。此外,建议开启spark.network.crypto.keyProvider 来实现密钥管理,增强整体的安全性。

    在进行配置时,可以结合使用 Spark 的密钥管理系统,例如 Apache Knox 或 HashiCorp Vault,这样可以更安全地管理加密密钥。

    以下是一个简化的示例配置,展示如何在 Spark 配置文件中启用 AES 加密及相关安全性设置:

    spark.network.crypto.enabled=true
    spark.network.crypto.algorithm=AES/CTR/NoPadding
    spark.network.crypto.keyProvider=org.apache.spark.deploy.security.HadoopSecurityManager
    

    同时,建议参考 Apache Spark 官方文档 以获取更多关于安全和加密传输的详细信息和最佳实践。这样可以确保在部署时考虑到所有相关的安全性因素。

    11月14日 回复 举报
    爱狠无奈
    11月14日

    Web UI的ACLs配置说明很详细,尤其是对于不同环境下的用户和组的权限管理,这是很重要的。

    众生永恒: @爱狠无奈

    在权限管理方面,细致的ACL配置确实是确保Spark应用安全性的重要步骤。可以考虑使用以下配置示例来实现针对特定用户和组的权限控制:

    <configuration>
        <property>
            <name>ssecurity.ui.acls.enable</name>
            <value>true</value>
        </property>
        <property>
            <name>spark.ui.view.acls</name>
            <value>user1,user2</value>
        </property>
        <property>
            <name>spark.ui.kill.acls</name>
            <value>admin,user2</value>
        </property>
        <property>
            <name>spark.ui.proxy.user</name>
            <value>admin</value>
        </property>
    </configuration>
    

    在这个配置中,不同的用户可以根据角色获得不同的权限,例如普通用户只能查看UI,而管理员则可以终止作业。为了获得更深入的理解,建议查阅官方文档,例如 Apache Spark Security 部分,了解如何在不同环境中优化安全设置。

    确保定期审查和更新ACL配置,以适应业务需求的变化,从而保持系统的安全性和高效性。

    11月13日 回复 举报
    风信子
    11月24日

    在生产环境中,Kerberos的配置不可或缺,尤其是对数据安全有较高要求的应用场景。

    蓝天: @风信子

    在讨论Kerberos在生产环境中的重要性时,不妨考虑其在Spark集群中的应用。配置Kerberos不仅可以加强数据的安全性,还能确保只有经过身份验证的用户可以访问敏感数据。

    例如,以下代码示例展示了如何在Spark中配置Kerberos认证:

    export SPARK_SUBMIT_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.auth.login.config=/etc/hadoop/conf/jaas.conf"
    

    在这个示例中,正确配置krb5.confjaas.conf是至关重要的,这两个文件负责定义Kerberos的配置和认证策略。jaas.conf文件内容可能类似于:

    com.sun.security.auth.module.Krb5LoginModule required
        useKeyTab=true
        keyTab="/path/to/keytab.file"
        principal="user@REALM"
        storeKey=true;
    

    此外,确保在生产环境中定期检查和更新Kerberos票据,以防止过期造成的访问问题。

    对于进一步的理解与实践,可以参考Apache Spark的官方文档了解更多关于安全配置的详细信息。通过这些措施,能够有效增强Spark集群的安全性与稳定性。

    11月10日 回复 举报
    杳无音信
    11月28日

    对于新手来说,Kubernetes的Kerberos应用示例代码很直观,实用性很强。

    素白: @杳无音信

    对于Kubernetes与Kerberos的结合,确实能够帮助新手快速上手。实际上,在实践中,理解如何为Spark应用配置Kerberos认证是很重要的。以下是一个简单的代码示例,展示如何在Kubernetes中配置Spark和Kerberos。

    apiVersion: v1
    kind: Pod
    metadata:
      name: spark-driver
    spec:
      containers:
      - name: spark-driver
        image: your-spark-image
        env:
          - name: KRB5_CONFIG
            value: "/etc/krb5.conf"
          - name: SPARK_CONF_DIR
            value: "/opt/spark/conf"
        volumeMounts:
          - name: krb5-config
            mountPath: /etc/krb5.conf
            subPath: krb5.conf
      volumes:
        - name: krb5-config
          configMap:
            name: krb5-config
    

    在上述示例中,Kubernetes Pod为Spark驱动程序设置了Kerberos所需的环境变量和配置文件。通过ConfigMap提供krb5.conf配置,可以确保Kerberos正常工作。

    此外,可以参考这篇文章了解更多关于在Kubernetes中配置安全认证的内容:Kubernetes and Kerberos。这样的配合使用,不仅提升了应用程序的安全性,也为新手提供了一个清晰的实现路径。

    11月16日 回复 举报
    建平
    12月08日

    了解了许多关于SSL配置的内容,特别是如何为不同组件设置SSL。在安全性方面是不可或缺的一环。

    笑小刀: @建平

    了解SSL配置的细节确实至关重要。在实现Spark的安全性时,SSL不仅能够加密数据传输,还能确保身份验证。比如,对于Spark在集群中的通信,正确配置SSL可以有效防止中间人攻击。

    在配置Spark的SSL时,可以参考以下示例,用于设置ssl属性:

    # spark-defaults.conf
    spark.ssl.enabled true
    spark.ssl.keyStore /path/to/keystore.jks
    spark.ssl.keyStorePassword your_password
    spark.ssl.trustStore /path/to/truststore.jks
    spark.ssl.trustStorePassword your_password
    

    配置后的集群将能够通过SSL进行加密通信,确保数据在传输过程中的安全。此外,值得注意的是,对于不同组件(如Spark Streaming、Spark SQL等),应确保它们的SSL设置一致,这样才能形成有效的安全防护。

    如果对SSL配置有更深入的需求,建议查阅 Apache Spark SSL Encryption Documentation 中的相关章节,能够更全面地理解如何实施和优化SSL设置。

    11月13日 回复 举报
    逝去的爱
    12月16日

    文中对于开启和配置SASL加密的说明很全面,可以用在老版本的服务交互上。

    中国移动: @逝去的爱

    对于SASL加密的配置,实际上可以考虑各种不同的配置选项以增强Spark的安全性。比如,可以使用以下配置示例,利用Kerberos进行认证:

    # 设置keberos为认证方式
    spark.security.authentication=kerberos
    
    # 指定KDC地址
    spark.yarn.principal=your_principal@YOUR_REALM
    
    # 指定密钥表路径
    spark.yarn.keytab=path/to/your.keytab
    

    在调优SASL和Spark的整合下,通过设置合适的权限和角色,能够确保数据流在分布式环境中安全传输。此外,建议在集群部署过程中,关注SSL/TLS加密传输的配置,以进一步保障数据安全。

    建议参考Apache Spark的官方文档,了解更多关于安全配置的细节,尤其是在不同版本间可能存在的差异:Apache Spark Security。确保在实施过程中的每一步都经过严密的测试和验证。

    11月19日 回复 举报
    小榔头
    12月22日

    对RPC通信的安全认证细节介绍得很详细,尤其是YARN和Kubernetes上的自动生成与分发认证秘钥的方法。

    萦绕枝头: @小榔头

    在探讨Spark的安全认证机制时,确实值得关注RPC通信的细节。我也注意到了在YARN和Kubernetes上自动生成与分发认证秘钥的方法。这种方式不仅减少了人工操作的错误率,还提高了整体安全性。

    在实际操作中,可以利用keytool工具生成并管理密钥库。例如,以下命令可用于创建一个新的密钥库:

    keytool -genkeypair -alias sparkKey -keyalg RSA -keystore /path/to/keystore.jks -keysize 2048
    

    接下来,也可以配置Spark以使用该密钥库来确保通信的安全性。通过在spark-defaults.conf中添加如下配置:

    1. spark.ssl.enabled true
    2. spark.ssl.keyStore /path/to/keystore.jks
    3. spark.ssl.keyStorePassword <your_password>

    除了认证秘钥的管理,在网络层面,也可以考虑使用服务网格技术如Istio来增强服务之间的通信安全。这些在Kubernetes环境中特别有用,能够提供细粒度的流量管理和安全策略。

    如果需要进一步深入,可以参考以下链接,获取更多的安全最佳实践信息:Apache Spark Security Documentation

    11月21日 回复 举报
    日度
    12月25日

    即使对Spark不太熟悉,也能从这篇写作中得到有关网络安全配置的完整视图。

    迁就: @日度

    text 对于网络安全配置而言,Spark的确是一个相对复杂但十分重要的话题。了解Spark的安全机制有助于保护数据和应用程序。可以借鉴一些基本的安全配置方法,比如启用SSL来保证数据传输的安全性。

    # 在spark-submit命令中加入以下参数来启用SSL
    --conf spark.ssl.enabled=true \
    --conf spark.ssl.keyStore=/path/to/keystore \
    --conf spark.ssl.keyStorePassword=your_password \
    --conf spark.ssl.trustStore=/path/to/truststore \
    --conf spark.ssl.trustStorePassword=your_password
    

    另外,考虑使用Apache Ranger或者Apache Sentry来管理数据访问权限,可以有效提升安全性。这两者都提供了细粒度的访问控制,适用于大型数据平台。

    有相关的安全最佳实践可以参考:Apache Spark Security Best Practices。通过系统的配置,可以极大增强Spark的安全性能,保护数据和处理过程不受威胁。

    11月12日 回复 举报
    流绪微梦
    12月29日

    关于历史服务器的ACLs启用方法一目了然,通过--conf spark.history.ui.acls.enable=true启用是一个好方法。

    -▲ 木兮: @流绪微梦

    对于启用历史服务器的ACLs的确是一个非常实用的话题。除了使用 --conf spark.history.ui.acls.enable=true 来启用外,还可以进一步配置 spark.history.ui.acls.bindGroupsToRolesspark.history.ui.acls.view,以实现更细粒度的权限控制。例如:

    --conf spark.history.ui.acls.bindGroupsToRoles=true
    --conf spark.history.ui.acls.view=group1,group2
    

    这样能确保只有特定的用户组能够访问历史服务器的UI。这种方法可以有效提升安全性,避免未授权访问。

    同时,查阅官方文档可以获得更深入的理解和应用示例。建议参考 Apache Spark Security Documentation 获取详细的安全配置指南。通过学习这些内容,可以更好地管理和保护Spark集群的资源。

    11月12日 回复 举报
    ×
    免费图表工具,画流程图、架构图