Apache Spark

Apache Spark is an open source cluster computing framework. It was originally developed at the University of California, Berkeley’s AMPLab. Spark facilitates advanced analytics in many different programming languages.

 


There are many cloud providers who provides big data solutions. Apache Spark is one of them. Here is a tutorial generated by Ray, the author of kindidata.com , for beginners. This tutorial shows how SQL queries are executed on Spark using Jupyter Notebook on HDInsight. HDInsight is a cloud based big data framework,provided by Microsoft.

 

Running SQL queries on Spark on HDInsight – Ray (Kindidata.com) – You Tube Video

 

Steps and Syntex:

1. Create cluster : Upload data using Azure storage Explorer
2. Launch Jupiter:

PySpark (for applications written in Python)
Spark (for applications written in Scala)

3. Import:

from pyspark.sql.types import *

4. Load sample data into a temporary table:

hvacText = sc.textFile(“wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv”)

5. Create the schema

hvacSchema = StructType([StructField(“date”, StringType(), False),StructField(“time”, StringType(), False),StructField(“targettemp”, IntegerType(), False),StructField(“actualtemp”, IntegerType(), False),StructField(“buildingID”, StringType(), False)])

6.Parse the data in hvacText

hvac = hvacText.map(lambda s: s.split(“,”)).filter(lambda s: s[0] != “Date”).map(lambda s:(str(s[0]), str(s[1]), int(s[2]), int(s[3]), str(s[6]) ))

7.  Create a data frame

hvacdf = sqlContext.createDataFrame(hvac,hvacSchema)

8.Register the data frame as a table to run queries against

hvacdf.registerTempTable(“hvac”)

#Run SQL Queries:

%%sql
SELECT * FROM hvac

%%sql
SELECT date , time , buildingID, actualtemp, date FROM hvac WHERE buildingID= \”16\”

 

14,438 total views, 3 views today

Comments are closed.