Relative to the other big data solutions, Data Lake Analytics is one of the more easier solutions to get up and running on Azure, well in my experience anyway. For a start, other than a background in database and some experience on Azure, you only really need to know SQL and C#, two skills many BI Developers in your typical organisation already have, unlike Hadoop where much more is needed. The language of Data Lake Analytics is USQL, a hybrid of C# and SQL.
So what is Azure Data Lakes Analytics (ADLA)? Microsoft explains it this way:
In this quick intro into ADLA, I’ll show what it actually looks like as well as some typical usage and how it fits into the wider Azure platform. It is not a step a step by step tutorial as there’s plenty of that and in greater depth already on Azure and other blogs but rather a slightly deeper dive including integration with other service and apps .
It is worth noting Azure Data Lake Analytics is not the same as Azure Data Lake Store (ADLS). ADLS serves as the hyper scale storage layer while ADLA is the processing engine. This article by Blue Granite consulting provides some great background information and how it compares to Hadoop. It is also especially worth noting scripting is now available for both R and Python in ADLA, albeit in a somewhat limited capacity (no debugging yet guys).
To get started, we’ll shred some JSON, a simple but common task your typical developer can relate to. This below is the JSON (straight from the Azure docos).
The great thing about ADLA is it truly lets you develop and test locally in Visual Studio. This is important as there’s a cost for each run on Azure (every time you process something). The idea is to code and test on your local machine (with your c drive acting as the Data Lake storage) and change the configurations when you deploy to Azure or remotely on your desktop. Like virtually everything else on Azure, you can also code and run the jobs directly in Azure portal though I suspect most developers prefer working in an IDE like Visual Studio (for a start, there is nodebugging on Azure).
Server Explorer within Visual Studio as seen below acts like SSMS for those used to SQL Server and lets you both run queries as well as obtain important property information needed during development. Like SSMS, you can switch between environments (local or Azure) as well as services (Data Lakes, Hadoop, Streaming etc).
USQL, the native code used in ADLA is a hybrid between C# and SQL. The code below shows one way to shred that JSON shown earlier (provided by Microsoft). In this example, there is no visible C# as that’s embedded in the DLLs referenced in the first four lines (3/4). What it will do when the code runs is flatten the JSON and output this to a CSV file. Note the ADLA Account in the top left hand corner-you can switch execution environments between local or Azure for where the code gets executed.
The Azure Storage Explorer as seen below is one of the most handy application Microsoft has developed for working with Azure. It is similar to Windows Explorer except it is looking at your big data storage in the clouds, including Data Lakes, Blobs, Cosmos NoSQL DBs and also holds critical information like access keys, connection strings as well as allowing you to perform simple tasks like uploads/downloads. As well as exploring the clouds, it lets you connect to local emulators. The output destination for the code above can be one of these Azure storage containers or locally
With your codes written, time for execution. Once execution starts, this progress indicator pops up (both in Azure and Visual Studio). Other than looking pretty cool, it also provides a whole stack of information you’ll need such as your input/output paths.
When the job completes, you can view the contents or export it elsewhere (in the clouds or your locally).
IF you go back to Data Lake storage, either in Azure or the Store Explorer as seen below, you’ll see our output file.
The USQL code we saw was only one flavour, the C# there was already compiled and within the DLL that we referenced in the USQL meaning we can’t really see what it is doing or can we modify it. The more common alternative approach is to make use of the code behind file (similar to a WPF applications except it is SQL and C# here and not XML and C#). The file looks like this below.
As for the code, the USQL file (to flatten a JSON file) would look like this. Notice the code difference-most of the work is done in code behind C#.
The C# would look something like this. Nothing special though the parameters for the Extract method is interesting and new (IUnstructureReader) .
I mentioned ADLA now supports both R and Python. I don’t do R but Python is straightforward and didn’t require any special set up (though my pc had Python installed and configured already). I pulled this down straight from GitHub and it ran without issues. Contrast the embedded approach here for Python (and presumably for R) with that of the code behind for C#-there is no code behind for R or Python. The embedded approach is similar/identical to Python/R scripting on SQL Server.
While you can use Azure Data Lakes Analytics as a stand alone service and run these jobs manually ad-hoc, there’s a whole bunch of Azure services that integrates with ADLA, both in Azure as well as on-prem, including SSIS.
We won’t go into the various ways of integration in this post but it’s still worth showing Azure Data Factory which I just happen to love. Data Factory is similar to SSIS though perhaps on steroids and looks something like this below. It lets you create all sorts of pipelines that integrates with other services like Machine Learning, Spark, Hive, CosmosDB.
So there you have it, a quick overview of Azure Data Lake Analytics. Pretty cool hey..