共计 3218 个字符,预计需要花费 9 分钟才能阅读完成。
环境
scala 2.12.x spark 3.2
异常
使用 udf 处理dataframe 的时候会出现隐式转化不报错
代码
package org.example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf}
/**
* Hello world!
*
*/
object App {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName("Java Spark SQL basic example")
.config("spark.master", "local")
.getOrCreate();
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
// df.show()
df.printSchema()
def test(st:String):String={
st+"test"
}
val testudf=udf(test _)
val df1=df.withColumn("test",testudf(col("age")))
df1.printSchema()
df1.show()
val df2=df.withColumn("test_state",col("age")+col("state"))
df2.printSchema()
df2.show()
//test
//
// def test(st:String):String={
//
// st+"test"
// }
//
// println(test(50))
}
}
结果如下所示
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- state: string (nullable = true)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)
|-- bonus: integer (nullable = false)
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- state: string (nullable = true)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)
|-- bonus: integer (nullable = false)
|-- test: string (nullable = true)
+-------------+----------+-----+------+---+-----+------+
|employee_name|department|state|salary|age|bonus| test|
+-------------+----------+-----+------+---+-----+------+
| James| Sales| NY| 90000| 34|10000|34test|
| Michael| Sales| NY| 86000| 56|20000|56test|
| Robert| Sales| CA| 81000| 30|23000|30test|
| Maria| Finance| CA| 90000| 24|23000|24test|
| Raman| Finance| CA| 99000| 40|24000|40test|
| Scott| Finance| NY| 83000| 36|19000|36test|
| Jen| Finance| NY| 79000| 53|15000|53test|
| Jeff| Marketing| CA| 80000| 25|18000|25test|
| Kumar| Marketing| NY| 91000| 50|21000|50test|
+-------------+----------+-----+------+---+-----+------+
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- state: string (nullable = true)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)
|-- bonus: integer (nullable = false)
|-- test_state: double (nullable = true)
+-------------+----------+-----+------+---+-----+----------+
|employee_name|department|state|salary|age|bonus|test_state|
+-------------+----------+-----+------+---+-----+----------+
| James| Sales| NY| 90000| 34|10000| null|
| Michael| Sales| NY| 86000| 56|20000| null|
| Robert| Sales| CA| 81000| 30|23000| null|
| Maria| Finance| CA| 90000| 24|23000| null|
| Raman| Finance| CA| 99000| 40|24000| null|
| Scott| Finance| NY| 83000| 36|19000| null|
| Jen| Finance| NY| 79000| 53|15000| null|
| Jeff| Marketing| CA| 80000| 25|18000| null|
| Kumar| Marketing| NY| 91000| 50|21000| null|
+-------------+----------+-----+------+---+-----+----------+
从上可以看出来 test 列的确是 int + string 的操作,udf里是要求传 string 但是传 int 也是可以正常出结果的。
猜想:跟2.12 scala 有关
验证:在代码中的test 注释里,写了一个测试函数是跑不通的,隐式转换直接使用有异常
猜想:跟spark3 升级有关
验证:之前同事在2.4碰到类似的一例问题,直接报错,但是切到3.2没问题
TODO
研究下看看是不是spark3 sql 模块的改动导致的。
正文完
请博主喝杯咖啡吧!