共计 1348 个字符,预计需要花费 4 分钟才能阅读完成。
这个系列好久没更新了,一个字懒,两个字很懒。这篇比较简单,主要的 API 就是 drop 函数。
val structureData = Seq(
Row("James","","Smith","36636","NewYork",3100),
Row("Michael","Rose","","40288","California",4300),
Row("Robert","","Williams","42114","Florida",1400),
Row("Maria","Anne","Jones","39192","Florida",5500),
Row("Jen","Mary","Brown","34561","NewYork",3000)
)
val structureSchema = new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType)
.add("id",StringType)
.add("location",StringType)
.add("salary",IntegerType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
说道 drop 这个 api ,我们先看下这个api 几个重载的形式
1) drop(colName : scala.Predef.String) : org.apache.spark.sql.DataFrame
2) drop(colNames : scala.Predef.String*) : org.apache.spark.sql.DataFrame
3) drop(col : org.apache.spark.sql.Column) : org.apache.spark.sql.DataFrame
第一个 api 你只要输入一个字符串 ,这个字符串表示的是dataframe的列名。
第二个与第一个之间的区别是是可以输入多个,这表示可以同时删除多列。
第三个 api 需要传入的参数是col
先给一个删除一列的例子:
val df2 = df.drop("firstname") //First signature
df2.printSchema()
df.drop(df("firstname")).printSchema()
//import org.apache.spark.sql.functions.col is required
df.drop(col("firstname")).printSchema() //Third signature
删除多列
//Refering more than one column
df.drop("firstname","middlename","lastname")
.printSchema()
// using array/sequence of columns
val cols = Seq("firstname","middlename","lastname")
df.drop(cols:_*)
.printSchema()
正文完
请博主喝杯咖啡吧!