In enterprise environments you meet kerberos. Often the open source and Hadoop stack have weak support for this. While it is true that nearly all services now support Kerberos, their integration with systems outside the Hadoop stack is not so complete. Spark and SQOOP for example can integrate with a kerberos enabled HDFS cluster but what about a NTLM enabled SQL Server instance using Windows Authentication? Often on posts and forums simple say use a username and password credentials. This is not acceptable at an enterprise level, especially large international banks.
Recently I’ve architected an ingestion framework using Spark, specifically reading data in using Spark’s JDBC functionality from a NTLM, windows authenticated SQL Server. This doesn’t work out of the box. SQL Server supports kerberos via JDBC with the following url:
Unfortunately this doesn’t work in Spark Yarn modes, while the executors have a ticket and can write to HDFS then connection to SQL Server is not established with this ticket. So authentication fails.
This lead to much head scratching. First try was using the JdbcRDD, a DataFrame is essentially a RDD with a schema. Looking at the signature you can see it accepts a function to get a database connection.
So you can use the JdbcRDD, pass it a connection and within that connection function get a new ticket and authenticate.
While this works you lose the benefits of a Dataframe, more importantly you have lost the schema. We did look at mapping the columns types to those returned from the JDBC result set but this is tedious, so we had a rethink. Previous I had an issue with Spark JDBC 1.3 where the fetch size was not passed to the JDBC driver, Spark set a default of 50 records, which is to low when you’re trying to load nearly a billion risk points. To fix this we wrapped the Oracle driver with our own. Whenever we issued a query, rather than specify “oracle” in the connection url we swapped it for our own “oracle_wrapper”. Within this wrapper we simply set the fetch size and returned the connection from the original driver. Our driver was simply a proxy.
With this in mind we did the same for Kerberos. We wrap the SQL Server driver, within the connection method call we get a kerberos ticket and then call the real SQL Server driver to return the connection logged in with this ticket.
The code uses the UserGroupInformation from the Hadoop API to login from a keytab and the “doAs” call to return the connection.
The above code doesn’t quite work out of the box with Spark DataFrames, it accepts a java Properties from which the principal and keytab are extracted but can be used as base to work with Spark. I can’t show the real code we use but Spark Dataframes accepts a url. In this connection url you can add the principal and keytab file name, then parse them out in the connect method like so
For this to work with Spark need to provide the kerberos principal and keytab to Spark. You can do this via the “–keytab” and “–principal” flags during your Spark Submit. The wrapped JDBC driver and the SQL Server driver need to be on the classpath of the driver and executors. Set the “–driver-class-path”
Also you need a keytab setup to allows passwordless access.
Big shout out to Greg Caban and Piotr Gabryanczyk for actual doing the hard work
Here’s a link to Greg’s repo where he has a version. Not sure if it works, give him a poke if it doesn’t because we do have it running at one of my clients.
In theory this should also work with SQOOP. Make sure to distribute the drivers for SQOOP to all the mappers and the keytab.