Wednesday, April 19, 2017

Securing Apache Hadoop Distributed File System (HDFS) - part I

Last year, I wrote a series of articles on securing Apache Kafka using Apache Ranger and Apache Sentry. In this series of posts I will look at how to secure the Apache Hadoop Distributed File System (HDFS) using Ranger and Sentry, such that only authorized users can access data stored in it. In this post we will look at a very basic way of installing Apache Hadoop and accessing some data stored in HDFS. Then we will look at how to authorize access to the data stored in HDFS using POSIX permissions and ACLs.

1) Installing Apache Hadoop

The first step is to download and extract Apache Hadoop. This tutorial uses version 2.7.3. The next step is to configure Apache Hadoop as a single node cluster so that we can easily get it up and running on a local machine. You will need to follow the steps outlined in the previous link to install ssh + pdsh. If you can't log in to localhost without a password ("ssh localhost") then you need to follow the instructions given in the link about setting up passphraseless ssh.

In addition, we want to run Apache Hadoop in pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process. Edit 'etc/hadoop/core-site.xml' and add:
Next edit 'etc/hadoop/hdfs-site.xml' and add:

Make sure that the JAVA_HOME variable in 'etc/hadoop/' is correct, and then format the filesystem and start Hadoop via:
  • bin/hdfs namenode -format
  • sbin/
To confirm that everything is working correctly, you can open "http://localhost:50090" and check on the status of the cluster there. Once Hadoop has started then upload and then access some data to HDFS:
  • bin/hadoop fs -mkdir /data
  • bin/hadoop fs -put LICENSE.txt /data
  • bin/hadoop fs -ls /data
  • bin/hadoop fs -cat /data/*
2) Securing HDFS using POSIX Permissions

We've seen how to access some data stored in HDFS via the command line. Now how can we create some authorization policies to restrict how to access this data? Well the simplest way is to use the standard POSIX Permissions. If we look at the /data directory we see that it has the following permissions "-rw-r--r--", which means other users can read the LICENSE file stored there. Remove access to other users apart from the owner via:
  • bin/hadoop fs -chmod og-r /data
Now create a test user called "alice" on your system and try to access the LICENSE we uploaded above via:
  • sudo -u alice bin/hadoop fs -cat /data/*
You will see an error that says "cat: Permission denied: user=alice, access=READ_EXECUTE".

3) Securing HDFS using ACLs

Securing access to data stored in HDFS via POSIX permissions works fine, however it does not allow you for example to specify fine-grained permissions for users other than the file owner. What if we want to allow "alice" from the previous section to read the file but not "bob"? We can achieve this via Hadoop ACLs. To enable ACLs, we will need to add a property called "dfs.namenode.acls.enabled" with value "true" to 'etc/hadoop/hdfs-site.xml' + re-start HDFS.

We can grant read access to 'alice' via:
  • bin/hadoop fs -setfacl -m user:alice:r-- /data/*
  • bin/hadoop fs -setfacl -m user:alice:r-x /data
To check to see the new ACLs associated with LICENSE.txt do:
  • bin/hadoop fs -getfacl /data/LICENSE.txt
In addition to the owner, we now have the ACL "user:alice:r--". Now we can read the data as "alice". However another user "bob" cannot read the data. To avoid confusion with future blog posts on securing HDFS, we will now remove the ACLs we added via:
  • bin/hadoop fs -setfacl -b /data
  • bin/hadoop fs -setfacl -b /data/LICENSE.txt

No comments:

Post a Comment