Apache Spark

Apache Spark is an open source cluster computing framework. It was originally developed at the University of California, Berkeley’s AMPLab. Spark facilitates advanced analytics in many different programming languages.


There are many cloud providers who provides big data solutions. Apache Spark is one of them. Here is a tutorial generated by Ray, the author of kindidata.com , for beginners. This tutorial shows how SQL queries are executed on Spark using Jupyter Notebook on HDInsight. HDInsight is a cloud based big data framework,provided by Microsoft.


Running SQL queries on Spark on HDInsight – Ray (Kindidata.com) – You Tube Video


Steps and Syntex:

1. Create cluster : Upload data using Azure storage Explorer
2. Launch Jupiter:

PySpark (for applications written in Python)
Spark (for applications written in Scala)

3. Import:

from pyspark.sql.types import *

4. Load sample data into a temporary table:

hvacText = sc.textFile(“wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv”)

5. Create the schema

hvacSchema = StructType([StructField(“date”, StringType(), False),StructField(“time”, StringType(), False),StructField(“targettemp”, IntegerType(), False),StructField(“actualtemp”, IntegerType(), False),StructField(“buildingID”, StringType(), False)])

6.Parse the data in hvacText

hvac = hvacText.map(lambda s: s.split(“,”)).filter(lambda s: s[0] != “Date”).map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

7.  Create a data frame

hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

8.Register the data frame as a table to run queries against


#Run SQL Queries:


SELECT date , time , buildingID, actualtemp, date FROM hvac WHERE buildingID= \”16\”


18,187 total views, 5 views today

Comments are closed.