Spark SQL Data Sources#
Data Sources | Description |
---|---|
JSON Datasets | Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame. |
Hive Tables | Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. |
Parquet Files | Parquet is a columnar format, supported by many data processing systems. |
Spark SQL JSON Datasets#
Spark SQL Hive Tables#
Start the Spark Shell#
$ spark-shell
Create SQLContext Object#
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Create Table using HiveQL#
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
Load Data into Table using HiveQL#
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
Select Fields from the Table#
scala> val result = sqlContext.sql("FROM employee SELECT id, name, age")
scala> result.show()
Spark SQL Parquet Files#
Parquet is a columnar format
Columnar storage limits IO operations.
Columnar storage can fetch specific columns that you need to access.
Columnar storage consumes less space.
Columnar storage gives better-summarized data and follows type-specific encoding.
employee.parquet
$ spark-shell
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val employee = sqlContext.read.json("employee.json")
scala> employee.write.parquet("employee.parquet")
$ cd employee.parquet/
$ ls
part-00000-86c92cba-7c98-45fc-9278-099c34214f4b-c000.snappy.parquet _SUCCESS
Open Spark Shell#
$ spark-shell
Create SQLContext Object#
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read Input from Text File#
scala> val parqfile = sqlContext.read.parquet("employee.parquet")
Store the DataFrame into the Table#
scala> parqfile.registerTempTable("employee")
Select Query on DataFrame#
scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")
scala> allrecords.show()