Spark SQL Data Sources
Data Sources | Description |
---|---|
JSON Datasets | Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame. |
Hive Tables | Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. |
Parquet Files | Parquet is a columnar format, supported by many data processing systems. |
Spark SQL JSON Datasets
Spark SQL Hive Tables
Start the Spark Shell
$ spark-shell
Create SQLContext Object
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Create Table using HiveQL
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
Load Data into Table using HiveQL
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO TABLE employee")
Select Fields from the Table
scala> val result = sqlContext.sql("FROM employee SELECT id, name, age")
scala> result.show()
Spark SQL Parquet Files
Parquet is a columnar format
Columnar storage limits IO operations.
Columnar storage can fetch specific columns that you need to access.
Columnar storage consumes less space.
Columnar storage gives better-summarized data and follows type-specific encoding.
employee.parquet
$ spark-shell
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val employee = sqlContext.read.json("employee.json")
scala> employee.write.parquet("employee.parquet")
$ cd employee.parquet/
$ ls
part-00000-86c92cba-7c98-45fc-9278-099c34214f4b-c000.snappy.parquet _SUCCESS
Open Spark Shell
$ spark-shell
Create SQLContext Object
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Read Input from Text File
scala> val parqfile = sqlContext.read.parquet("employee.parquet")
Store the DataFrame into the Table
scala> parqfile.registerTempTable("employee")
Select Query on DataFrame
scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")
scala> allrecords.show()