電資女孩的研究生活: [PySpark] From Pandas to Apache Spark’s DataFrame

>>> from pyspark.sql import SQLContext

>>> sqlCtx = SQLContext(sc)
>>> spark_df = sqlCtx.createDataFrame(pandas_df)

16/06/09 19:24:46 WARN TaskSetManager: Stage 0 contains a task of very large size (8851 KB). The maximum recommended task size is 100 KB.
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
|Store|DayOfWeek| Date|Sales|Customers|Open|Promo|StateHoliday|SchoolHoliday|
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
| 1| 5|1438300800000000000| 5263| 555| 1| 1| 0| 1|
| 2| 5|1438300800000000000| 6064| 625| 1| 1| 0| 1|
| 3| 5|1438300800000000000| 8314| 821| 1| 1| 0| 1|
| 4| 5|1438300800000000000|13995| 1498| 1| 1| 0| 1|
| 5| 5|1438300800000000000| 4822| 559| 1| 1| 0| 1|
| 6| 5|1438300800000000000| 5651| 589| 1| 1| 0| 1|
| 7| 5|1438300800000000000|15344| 1414| 1| 1| 0| 1|
| 8| 5|1438300800000000000| 8492| 833| 1| 1| 0| 1|
| 9| 5|1438300800000000000| 8565| 687| 1| 1| 0| 1|
| 10| 5|1438300800000000000| 7185| 681| 1| 1| 0| 1|
| 11| 5|1438300800000000000|10457| 1236| 1| 1| 0| 1|
| 12| 5|1438300800000000000| 8959| 962| 1| 1| 0| 1|
| 13| 5|1438300800000000000| 8821| 568| 1| 1| 0| 0|
| 14| 5|1438300800000000000| 6544| 710| 1| 1| 0| 1|
| 15| 5|1438300800000000000| 9191| 766| 1| 1| 0| 1|
| 16| 5|1438300800000000000|10231| 979| 1| 1| 0| 1|
| 17| 5|1438300800000000000| 8430| 946| 1| 1| 0| 1|
| 18| 5|1438300800000000000|10071| 936| 1| 1| 0| 1|
| 19| 5|1438300800000000000| 8234| 718| 1| 1| 0| 1|
| 20| 5|1438300800000000000| 9593| 974| 1| 1| 0| 0|
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
only showing top 20 rows

[Reference]

Introducing DataFrames in Apache Spark for Large Scale Data Science

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html

http://stackoverflow.com/questions/32966344/converting-pandas-dataframes-to-spark-dataframe-in-zeppelin