Eagle - Data Classification Tutorial

Apache Eagle data classification feature provides the ability to classify data with different levels of sensitivity. Currently this feature is available ONLY for applications monitoring HDFS, Hive¹ and HBase². For example, HdfsAuditLog, HiveQueryLog and HBaseSecurityLog.

The main content of this page are

Cluster Connection
Data Classification

Cluster Connection

Here we give example configurations for HDFS, HBASE, and Hive. Suppose the cluster to monitor is Hortonwork sandbox. This configuration locates at the admin management page.

HDFS
- Base case
  
  You may configure the default path for Apache Hadoop clients to connect remote hdfs namenode.
```
  classification.fs.defaultFS=hdfs://sandbox.hortonworks.com:8020
```
- HA case
  
  Basically, you point your fs.defaultFS at your nameservice and let the client know how its configured (the backing namenodes) and how to fail over between them under the HA mode
```
  classification.fs.defaultFS=hdfs://nameservice1
  classification.dfs.nameservices=nameservice1
  classification.dfs.ha.namenodes.nameservice1=namenode1,namenode2
  classification.dfs.namenode.rpc-address.nameservice1.namenode1=hadoopnamenode01:8020
  classification.dfs.namenode.rpc-address.nameservice1.namenode2=hadoopnamenode02:8020
  classification.dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
```
- Kerberos-secured cluster
  
  For Kerberos-secured cluster, you need to get a keytab file and the principal from your admin, and configure “eagle.keytab.file” and “eagle.kerberos.principal” to authenticate its access.
```
  classification.eagle.keytab.file=/EAGLE-HOME/.keytab/eagle.keytab
  classification.eagle.kerberos.principal=eagle@SOMEWHERE.COM
```
  If there is an exception about “invalid server principal name”, you may need to check the DNS resolver, or the data transfer , such as “dfs.encrypt.data.transfer”, “dfs.encrypt.data.transfer.algorithm”, “dfs.trustedchannel.resolver.class”, “dfs.datatransfer.client.encrypt”.

Hive

Basic

  classification.accessType=metastoredb_jdbc
  classification.password=hive
  classification.user=hive
  classification.jdbcDriverClassName=com.mysql.jdbc.Driver
  classification.jdbcUrl=jdbc:mysql://sandbox.hortonworks.com/hive?createDatabaseIfNotExist=true

HBase

Basic case

You need to sett “hbase.zookeeper.quorum”:”localhost” property and “hbase.zookeeper.property.clientPort” property.
```
  classification.hbase.zookeeper.property.clientPort=2181
  classification.hbase.zookeeper.quorum=localhost
```

Kerberos-secured cluster

According to your environment, you can add or remove some of the following properties. Here is the reference.

  classification.hbase.zookeeper.property.clientPort=2181
  classification.hbase.zookeeper.quorum=localhost
  classification.hbase.security.authentication=kerberos
  classification.hbase.master.kerberos.principal=hadoop/_HOST@EXAMPLE.COM
  classification.zookeeper.znode.parent=/hbase
  classification.eagle.keytab.file=/EAGLE-HOME/.keytab/eagle.keytab
  classification.eagle.kerberos.principal=eagle@EXAMPLE.COM

Any questions on the Kerberos configuration in Eagle, please first check FAQ

Data Classification

After the configuration is updated, we can go back to the classification page on Eagle UI. Here we take HdfsAuditLog as an example to explain how to classify data and how to monitor sensitive data in Eagle.

Part 1: Sensitivity Edit

add the sensitive mark to files/directories.
- Basic: Label sensitivity files directly (recommended)
- Advanced: Import json file/content
remove sensitive mark on files/directories
- Basic: remove label directly
- Advanced: delete lin batch