2016年5月17日 星期二

[PySpark] Getting started with PySpark

Getting started with PySpark

[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ ls
docs  lib  pyspark  run-tests  run-tests.py  test_support
[hadoop@master01 python]$ pyspark

Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/17 20:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.



Word count example

>>> lines = sc.textFile('hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt')
>>> lines_nonempty = lines.filter( lambda x: len(x) > 0 )
>>> lines_nonempty.count()
662761                                                                         
>>>
>>> words = lines_nonempty.flatMap(lambda x: x.split())
>>> wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)
>>> wordcounts.take(10)                                                        
[(319239, u'the'), (204299, u'of'), (158585, u'and'), (149022, u'to'), (113795, u'a'), (94854, u'in'), (78748, u'I'), (65001, u'that'), (52567, u'his'), (52506, u'was')]



[Reference]
Getting started with PySpark - Part 1
http://www.mccarroll.net/blog/pyspark/

沒有留言: