Parse XML with PySpark in Databricks
It's kind of a trick title, but here's the answer: don't. Just don't do it. Python is no good here - you might as well drop into Scala for this one [edit: foreach/foreachbatch should actually be pretty good here - I'll add a sample later].
My issue was that I needed to parse XML that's coming in through an Event Hub stream. That is, not from a file. The library that Databricks wrote for XML parsing was optimized for reading directly from files, so it's a little trickier than you'd think.
I found this little snippet online somewhere.
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import com.databricks.spark.xml.XmlReader
// cast the binary stream body to the xml-containing string
val stream = spark.read
.format("delta")
.load("xml.delta")
.selectExpr("CAST(body AS STRING)").as[(String)]
val df = (new XmlReader()).xmlRdd(sqlContext, stream.rdd) // <-- This is the magic line
.withColumn("user", explode(col("users")))
.withColumn("action", explode(col("user.actions")))
.withColumn("action_name", col("action._actionName"))
As noted above, the magic line is really (new XmlReader()).xmlRdd(sqlContext, stream.rdd)
, which dumps the xml strings into an rdd and then reads it back into the XmlReader. Fantastic. Forgive any atrocious scala.