Spark SQL数据加载和保存实战

水莹儿 · 发表于 2018-10-20 11:03:08

　　一：前置知识详解：
　　Spark SQL重要是操作DataFrame，DataFrame本身提供了save和load的操作，
　　Load：可以创建DataFrame，
　　Save：把DataFrame中的数据保存到文件或者说与具体的格式来指明我们要读取的文件的类型以及与具体的格式来指出我们要输出的文件是什么类型。
　　二：Spark SQL读写数据代码实战：

http://common.cnblogs.com/images/copycode.gif

import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.Function;import org.apache.spark.sql.*;import org.apache.spark.sql.types.DataTypes;import org.apache.spark.sql.types.StructField;import org.apache.spark.sql.types.StructType;import java.util.ArrayList;import java.util.List;public class SparkSQLLoadSaveOps { public static void main(String[] args) {　　
      SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkSQLLoadSaveOps");
　　
      JavaSparkContext sc = new JavaSparkContext(conf);
　　
      SQLContext = new SQLContext(sc);       /**
　　
      * read()是DataFrameReader类型，load可以将数据读取出来       */
　　
      DataFrame peopleDF = sqlContext.read().format("json").load("E:\\Spark\\Sparkinstanll_package\\Big_Data_Software\\spark-1.6.0-bin-hadoop2.6\\examples\\src\\main\\resources\\people.json");       /**
　　
      * 直接对DataFrame进行操作
　　
      * Json: 是一种自解释的格式，读取Json的时候怎么判断其是什么格式？
　　
      * 通过扫描整个Json。扫描之后才会知道元数据       */
　　
      //通过mode来指定输出文件的是append。创建新文件来追加文件
　　
peopleDF.select("name").write().mode(SaveMode.Append).save("E:\\personNames");
　　
}
　　
}

http://common.cnblogs.com/images/copycode.gif

　　读取过程源码分析如下：
　　1. read方法返回DataFrameReader，用于读取数据。

http://common.cnblogs.com/images/copycode.gif

`DataFrameReader` that can be used to read data in as a `DataFrame`. * {{{ * sqlContext.read.parquet("/path/to/file.parquet") * sqlContext.read.schema(schema).json("/path/to/file.json") * }}} *　　
* @group genericdata * @since 1.4.0
　　
*/@Experimental//创建DataFrameReader实例，获得了DataFrameReader引用def read: DataFrameReader = new DataFrameReader(this)

http://common.cnblogs.com/images/copycode.gif

2. 然后再调用DataFrameReader类中的format，指出读取文件的格式。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Specifies the input data source format.
　　
*
　　
* @since 1.4.0 */def format(source: String): DataFrameReader = { this.source = source this}

http://common.cnblogs.com/images/copycode.gif

3. 通过DtaFrameReader中load方法通过路径把传入过来的输入变成DataFrame。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Loads input in as a `DataFrame`, for data sources that require a path (e.g. data backed by
　　
* a local or distributed file system).
　　
*
　　
* @since 1.4.0 */// TODO: Remove this one in Spark 2.0.def load(path: String): DataFrame = {
　　
option("path", path).load()
　　
}

http://common.cnblogs.com/images/copycode.gif

　　至此，数据的读取工作就完成了，下面就对DataFrame进行操作。
　　下面就是写操作！！！
　　1. 调用DataFrame中select函数进行对列筛选

http://common.cnblogs.com/images/copycode.gif

/**　　
* Selects a set of columns. This is a variant of `select` that can only select
　　
* existing columns using column names (i.e. cannot construct expressions).
　　
*
　　
* {{{
　　
* // The following two are equivalent:
　　
* df.select("colA", "colB")
　　
* df.select($"colA", $"colB")
　　
* }}}
　　
* @group dfops
　　
* @since 1.3.0 */@scala.annotation.varargs
　　
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

http://common.cnblogs.com/images/copycode.gif

2. 然后通过write将结果写入到外部存储系统中。

http://common.cnblogs.com/images/copycode.gif

/**　　
* :: Experimental ::
　　
* Interface for saving the content of the `DataFrame` out into external storage.
　　
*
　　
* @group output
　　
* @since 1.4.0 */@Experimental
　　
def write: DataFrameWriter = new DataFrameWriter(this)

http://common.cnblogs.com/images/copycode.gif

3. 在保持文件的时候mode指定追加文件的方式

http://common.cnblogs.com/images/copycode.gif

/**　　
* Specifies the behavior when data or table already exists. Options include:
　　
// Overwrite是覆盖
　　
* - `SaveMode.Overwrite`: overwrite the existing data.
　　
//创建新的文件，然后追加
　　
* - `SaveMode.Append`: append the data.
　　
* - `SaveMode.Ignore`: ignore the operation (i.e. no-op).
　　
* - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime.
　　
*
　　
* @since 1.4.0 */def mode(saveMode: SaveMode): DataFrameWriter = { this.mode = saveMode this}

http://common.cnblogs.com/images/copycode.gif

4. 最后，save()方法触发action，将文件输出到指定文件中。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Saves the content of the `DataFrame` at the specified path.
　　
*
　　
* @since 1.4.0 */def save(path: String): Unit = { this.extraOptions += ("path" -> path)
　　
save()
　　
}

http://common.cnblogs.com/images/copycode.gif

　　三：Spark SQL读写整个流程图如下：

http://images2015.cnblogs.com/blog/908351/201704/908351-20170420212546493-1985800116.png

　　四：对于流程中部分函数源码详解：
　　DataFrameReader.Load（）
　　1. Load（）返回DataFrame类型的数据集合，使用的数据是从默认的路径读取。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Returns the dataset stored at path as a DataFrame,
　　
* using the default data source configured by spark.sql.sources.default.
　　
*
　　
* @group genericdata
　　
* @deprecated As of 1.4.0, replaced by `read().load(path)`. This will be removed in Spark 2.0. */@deprecated("Use read.load(path). This will be removed in Spark 2.0.", "1.4.0")
　　
def load(path: String): DataFrame = {//此时的read就是DataFrameReader read.load(path)
　　
}

http://common.cnblogs.com/images/copycode.gif

2. 追踪load源码进去，源码如下：　　在DataFrameReader中的方法。Load()通过路径把输入传进来变成一个DataFrame。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Loads input in as a `DataFrame`, for data sources that require a path (e.g. data backed by
　　
* a local or distributed file system).
　　
*
　　
* @since 1.4.0 */// TODO: Remove this one in Spark 2.0.def load(path: String): DataFrame = {
　　
option("path", path).load()
　　
}

http://common.cnblogs.com/images/copycode.gif

3. 追踪load源码如下：

http://common.cnblogs.com/images/copycode.gif

/**　　
* Loads input in as a `DataFrame`, for data sources that don't require a path (e.g. external
　　
* key-value stores).
　　
*
　　
* @since 1.4.0 */def load(): DataFrame = {//对传入的Source进行解析
　　
val resolved = ResolvedDataSource(
　　
sqlContext,
　　
userSpecifiedSchema = userSpecifiedSchema,
　　
partitionColumns = Array.empty[String],
　　
provider = source,
　　
options = extraOptions.toMap)
　　
DataFrame(sqlContext, LogicalRelation(resolved.relation))
　　
}

http://common.cnblogs.com/images/copycode.gif

　　DataFrameReader.format()
　　1. Format：具体指定文件格式，这就获得一个巨大的启示是：如果是Json文件格式可以保持为Parquet等此类操作。
　　Spark SQL在读取文件的时候可以指定读取文件的类型。例如，Json,Parquet.

http://common.cnblogs.com/images/copycode.gif

/**　　
* Specifies the input data source format.Built-in options include “parquet”,”json”,etc.
　　
*
　　
* @since 1.4.0 */def format(source: String): DataFrameReader = { this.source = source //FileType
　　
this}

http://common.cnblogs.com/images/copycode.gif

　　DataFrame.write()
　　1. 创建DataFrameWriter实例

http://common.cnblogs.com/images/copycode.gif

/**　　
* :: Experimental ::
　　
* Interface for saving the content of the `DataFrame` out into external storage.
　　
*
　　
* @group output
　　
* @since 1.4.0 */@Experimental
　　
def write: DataFrameWriter = new DataFrameWriter(this)

http://common.cnblogs.com/images/copycode.gif

2. 追踪DataFrameWriter源码如下：　　以DataFrame的方式向外部存储系统中写入数据。

http://common.cnblogs.com/images/copycode.gif

/**　　
* :: Experimental ::
　　
* Interface used to write a `DataFrame` to external storage systems (e.g. file systems,
　　
* key-value stores, etc). Use `DataFrame`.`write` to access this.
　　
*
　　
* @since 1.4.0 */@Experimentalfinal class DataFrameWriter private[sql](df: DataFrame) {

http://common.cnblogs.com/images/copycode.gif

　　DataFrameWriter.mode()
　　1. Overwrite是覆盖，之前写的数据全都被覆盖了。
　　Append:是追加，对于普通文件是在一个文件中进行追加，但是对于parquet格式的文件则创建新的文件进行追加。

http://common.cnblogs.com/images/copycode.gif

**　　
* Specifies the behavior when data or table already exists. Options include: * - `SaveMode.Overwrite`: overwrite the existing data. * - `SaveMode.Append`: append the data. * - `SaveMode.Ignore`: ignore the operation (i.e. no-op).//默认操作
　　
* - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. *
　　
* @since 1.4.0
　　
*/def mode(saveMode: SaveMode): DataFrameWriter = { this.mode = saveMode this}

http://common.cnblogs.com/images/copycode.gif

2. 通过模式匹配接收外部参数

http://common.cnblogs.com/images/copycode.gif

/**　　
* Specifies the behavior when data or table already exists. Options include:
　　
* - `overwrite`: overwrite the existing data.
　　
* - `append`: append the data.
　　
* - `ignore`: ignore the operation (i.e. no-op).
　　
* - `error`: default option, throw an exception at runtime.
　　
*
　　
* @since 1.4.0 */def mode(saveMode: String): DataFrameWriter = {  this.mode = saveMode.toLowerCase match { case "overwrite" => SaveMode.Overwrite case "append" => SaveMode.Append case "ignore" => SaveMode.Ignore case "error" | "default" => SaveMode.ErrorIfExists case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " +
　　
   "Accepted modes are 'overwrite', 'append', 'ignore', 'error'.")
　　
  }  this}

http://common.cnblogs.com/images/copycode.gif

　　
1. save将结果保存传入的路径。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Saves the content of the `DataFrame` at the specified path.
　　
*
　　
* @since 1.4.0 */def save(path: String): Unit = { this.extraOptions += ("path" -> path)
　　
save()
　　
}

http://common.cnblogs.com/images/copycode.gif

2. 追踪save方法。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Saves the content of the `DataFrame` as the specified table.
　　
*
　　
* @since 1.4.0 */def save(): Unit = {
　　
ResolvedDataSource(
　　
df.sqlContext,
　　
source,
　　
partitioningColumns.map(_.toArray).getOrElse(Array.empty[String]),
　　
mode,
　　
extraOptions.toMap,
　　
df)
　　
}

http://common.cnblogs.com/images/copycode.gif

3.  其中source是SQLConf的defaultDataSourceNameprivate var source: String = df.sqlContext.conf.defaultDataSourceName　　其中DEFAULT_DATA_SOURCE_NAME默认参数是parquet。
// This is used to set the default data sourceval DEFAULT_DATA_SOURCE_NAME = stringConf("spark.sql.sources.default",　　
  defaultValue = Some("org.apache.spark.sql.parquet"),
　　
  doc = "The default data source to use in input/output.")
　　DataFrame.Scala中部分函数详解：
　　1. toDF函数是将RDD转换成DataFrame

http://common.cnblogs.com/images/copycode.gif

**　　
* Returns the object itself. * @group basic * @since 1.3.0
　　
*/// This is declared with parentheses to prevent the Scala compiler from treating// `rdd.toDF("1")` as invoking this toDF and then apply on the returned DataFrame.def toDF(): DataFrame = this

http://common.cnblogs.com/images/copycode.gif

2. show()方法：将结果显示出来

http://common.cnblogs.com/images/copycode.gif

/**　　
* Displays the `DataFrame` in a tabular form. For example:
　　
* {{{
　　
* year  month AVG('Adj Close) MAX('Adj Close)
　　
* 1980  12 0.503218       0.595103
　　
* 1981  01 0.523289       0.570307
　　
* 1982  02 0.436504       0.475256
　　
* 1983  03 0.410516       0.442194
　　
* 1984  04 0.450090       0.483521
　　
* }}}
　　
* @param numRows Number of rows to show
　　
* @param truncate Whether truncate long strings. If true, strings more than 20 characters will
　　
*             be truncated and all cells will be aligned right
　　
*
　　
* @group action
　　
* @since 1.5.0 */// scalastyle:off printlndef show(numRows: Int, truncate: Boolean): Unit = println(showString(numRows, truncate))// scalastyle:on println

http://common.cnblogs.com/images/copycode.gif

　　追踪showString源码如下：showString中触发action收集数据。

http://common.cnblogs.com/images/copycode.gif

/**　　
* Compose the string representing rows for output
　　
* @param _numRows Number of rows to show
　　
* @param truncate Whether truncate long strings and align cells right */private[sql] def showString(_numRows: Int, truncate: Boolean = true): String = {
　　
  val numRows = _numRows.max(0)
　　
  val sb = new StringBuilder
　　
  val takeResult = take(numRows + 1)
　　
  val hasMoreData = takeResult.length > numRows
　　
  val data = takeResult.take(numRows)
　　
  val numCols = schema.fieldNames.length

http://common.cnblogs.com/images/copycode.gif

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] Spark SQL数据加载和保存实战

浏览过的版块

扫码加入运维网微信交流群