共计 3242 个字符,预计需要花费 9 分钟才能阅读完成。
Spark Schema 定义了 DataFrame 的数据类型,你可以通过调用 printSchema
方法来打印相应的Schema。默认的情况下 Spark 会自动推导获取的数据对应的数据类型。
1. Schema
在介绍 Dtype 的时候我们就已经提到了 StructType,接下来我们主要使用这个来构建 Schema,说白了Schema就是提前定义好数据类型,然后获取数据填充就好了。
2. StructType & StructField 构建 Schema
StructType & StructField case class 代码定义如下:
case class StructType(fields: Array[StructField])
case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty)
下文是个简单的例子来说明使用的StructType
来构建对应的DataFrame 的Schema。
import org.apache.spark.sql.types.{IntegerType,StringType,StructType,StructField}
import org.apache.spark.sql.{Row, SparkSession}
val simpleData = Seq(Row("James","","Smith","36636","M",3000),
Row("Michael","Rose","","40288","M",4000),
Row("Robert","","Williams","42114","M",4000),
Row("Maria","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema)
3. printSchema 打印 Schema
直接调用 DataFrame 相应的内置方法 printSchema
,输出对应的Schema,在spark shell 中会以树层级结构的方式来展示。
df.printSchema()
df.show()
下面就是打印输出的效果。
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| | F| -1|
+---------+----------+--------+-----+------+------+
4. 构建嵌套 Struct
在实际的业务实践过程中,经常会使用到嵌套的 Struct
下面的例子就是拿 name 来举例来说明,每个人有 firstname middlename lastname
val structureData = Seq(
Row(Row("James","","Smith"),"36636","M",3100),
Row(Row("Michael","Rose",""),"40288","M",4300),
Row(Row("Robert","","Williams"),"42114","M",1400),
Row(Row("Maria","Anne","Jones"),"39192","F",5500),
Row(Row("Jen","Mary","Brown"),"","F",-1)
)
val structureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
.add("salary",IntegerType)
val df2 = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df2.printSchema()
df2.show()
打印相应的 Schema 可以看到对应的嵌套效果。
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)
+--------------------+-----+------+------+
| name| id|gender|salary|
+--------------------+-----+------+------+
| [James, , Smith]|36636| M| 3100|
| [Michael, Rose, ]|40288| M| 4300|
| [Robert, , Willi...|42114| M| 1400|
| [Maria, Anne, Jo...|39192| F| 5500|
| [Jen, Mary, Brown]| | F| -1|
+--------------------+-----+------+------+
5. 检查 Field
当 DataFrame 中的列多了之后,你不可能一个一个去检查是否包含指定的列,所以你需要去判断它是否在其中,可以通过以下方法来实现。
println(df.schema.fieldNames.contains("firstname"))
println(df.schema.contains(StructField("firstname",StringType,true)))