2017-10-19 29 views

Je fais une UNION de deux tables temporaires et j'essaie de les trier par colonne mais l'étincelle se plaint que la colonne que je commande ne peut pas être résolue. Est-ce un bug ou il me manque quelque chose?Spark SQL UNION - colonne ORDER BY pas SELECT

lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate() 
    import org.apache.spark.sql.types.StringType 

    val oldOrders = Seq(
    Seq("old_order_id1", "old_order_name1", "true"), 
    Seq("old_order_id2", "old_order_name2", "true") 

    val newOrders = Seq(
    Seq("new_order_id1", "new_order_name1", "false"), 
    Seq("new_order_id2", "new_order_name2", "false") 
    val schema = new StructType() 
    .add("id", StringType) 
    .add("name", StringType) 
    .add("is_old", StringType) 

    val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema) 
    val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema) 


    //ordering by column not in select works if I'm not doing UNION 
     |SELECT oo.id, oo.name FROM old_orders oo 
     |ORDER BY oo.is_old 

    //ordering by column not in select doesn't work as I'm doing a UNION 
     |SELECT oo.id, oo.name FROM old_orders oo 
     |SELECT no.id, no.name FROM new_orders no 
     |ORDER BY oo.is_old 

La sortie du code ci-dessus est:

|   id|   name| 

cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9; 
'Sort ['oo.is_old ASC NULLS FIRST], true 
+- Distinct 
    +- Union 
     :- Project [id#121, name#122] 
     : +- SubqueryAlias oo 
     :  +- SubqueryAlias old_orders 
     :  +- LogicalRDD [id#121, name#122, is_old#123] 
     +- Project [id#131, name#132] 
     +- SubqueryAlias no 
      +- SubqueryAlias new_orders 
       +- LogicalRDD [id#131, name#132, is_old#133] 

org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9; 
'Sort ['oo.is_old ASC NULLS FIRST], true 
+- Distinct 
    +- Union 
     :- Project [id#121, name#122] 
     : +- SubqueryAlias oo 
     :  +- SubqueryAlias old_orders 
     :  +- LogicalRDD [id#121, name#122, is_old#123] 
     +- Project [id#131, name#132] 
     +- SubqueryAlias no 
      +- SubqueryAlias new_orders 
       +- LogicalRDD [id#131, name#132, is_old#133] 

commande donc par une colonne qui est pas dans la clause SELECT fonctionne si je ne fais pas un syndicat et il échoue si je fais une union de deux tables.


// So even the syntax of Spark SQL is very similar to SQL, 
// but they are working very differently. Under the hood of Spark, its all about Rdds/dataframes. 
// After the UNION statement, a new dataframe is generated, and we are not able to refer the fields from the old table/dataframe if we did not select them. 

// how to fix 
    |SELECT id, name 
    |FROM (
    | SELECT oo.id, oo.name, oo.is_old FROM old_orders oo 
    | UNION 
    | SELECT no.id, no.name, no.is_old FROM new_orders no 
    | ORDER BY oo.is_old 
    |) t 
