2016年5月17日 星期二

[PySpark] Getting started with PySpark

Getting started with PySpark

[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ ls
docs  lib  pyspark  run-tests  run-tests.py  test_support
[hadoop@master01 python]$ pyspark

Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/17 20:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.



Word count example

>>> lines = sc.textFile('hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt')
>>> lines_nonempty = lines.filter( lambda x: len(x) > 0 )
>>> lines_nonempty.count()
662761                                                                         
>>>
>>> words = lines_nonempty.flatMap(lambda x: x.split())
>>> wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)
>>> wordcounts.take(10)                                                        
[(319239, u'the'), (204299, u'of'), (158585, u'and'), (149022, u'to'), (113795, u'a'), (94854, u'in'), (78748, u'I'), (65001, u'that'), (52567, u'his'), (52506, u'was')]



[Reference]
Getting started with PySpark - Part 1
http://www.mccarroll.net/blog/pyspark/

2016年5月9日 星期一

[Spark] Collaborative Filtering, alternating least squares (ALS) practice


Collaborative Filtering - spark.mllib
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering

In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.







Result :
Mean Squared Error = 5.491294660658085E-6



-------------------------------------------------------------------------------------------------------

ERROR : taskSchedulerImpl: Initial job has not accepted any resources
http://www.datastax.com/dev/blog/common-spark-troubleshooting






-------------------------------------------------------------------------------------------------------

ALS
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$

ALS.scala
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

Movie Recommendations with MLlib
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

Dataset - MovieLens 1M Dataset
http://grouplens.org/datasets/movielens/



2016年5月6日 星期五

[Spark1.6.0] ERROR SparkContext: Error initializing SparkContext


[hadoop@master01 spark-1.6.0]$ spark-shell
16/05/06 16:46:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
16/05/06 16:46:54 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Log directory hdfs:///user/spark/eventlog does not exist.

    at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:101)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:549)
    at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
    at $line3.$read$$iwC$$iwC.<init>(<console>:15)
    at $line3.$read$$iwC.<init>(<console>:24)
    at $line3.$read.<init>(<console>:26)
    at $line3.$read$.<init>(<console>:30)
    at $line3.$read$.<clinit>(<console>)
    at $line3.$eval$.<init>(<console>:7)
    at $line3.$eval$.<clinit>(<console>)



Solution :
hdfs dfs -mkdir -p /user/spark/eventlog

[Spark1.6.0] Install Scala & Spark


Download and install Scala 2.11.8


Set Scala configure

---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc

#scala
export SCALA_HOME=/opt/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin

source ~/.bashrc




---------------------------------------------------------------------------------------
test
[hadoop@master01 lib]$ scala
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91).
Type in expressions for evaluation. Or try :help.

scala> 1+1
res0: Int = 2
---------------------------------------------------------------------------------------



Download and install Spark 1.6.0 on Hadoop 2.6
Set Spark configure
---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc

#Spark
export SPARK_HOME=/opt/spark-1.6.0
export PATH=$PATH:$SPARK_HOME/bin

source ~/.bashrc
---------------------------------------------------------------------------------------


 
cp spark-env.sh.template spark-env.sh
sudo gedit spark-env.sh

export SCALA_HOME=/opt/scala-2.11.8
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export SPARK_MASTER_IP=master01
export SPARK_WORKER_MEMORY=1024m


spark.master spark://master01:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/spark/eventlog



ps aux | grep spark
hadoop     969  0.0  0.0 112644   952 pts/0    R+   21:21   0:00 grep --color=auto spark



---------------------------------------------------------------------------------------
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
 
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

---------------------------------------------------------------------------------------
Word count example

scala> val textFile = sc.textFile("hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:29

scala> wordCounts.collect()
res0: Array[(String, Int)] = Array(('lopin',1), (Ah!,99), (houres,,36), (Committee,),1), (bone,40), (fleein',1), (�Head.�,1), (delinquents.,2), (Malwa,1), (routing*,2), ('farthest,1), (Dollours,2), (Feldkirch,,3), ((1754-1831),,1), (nothin,1), (untruthfulness.,1), (signal.,6), (langwidge,3), (drad;*,1), (meets,,3), (Lost.,3), (Papists,,6), (accompts,,2), (Goodbye!,1), (Galliard,4), ((1563-1631),1), (Anthonio,,40), (God-forsaken,4), (rightly-,1), (fowl,30), (coat;,3), (husky,5), (Carpenter,4), (precious*,1), (ampullaria,1), (afterward,64), (armes*,,2), (entend*,1), (provisioned,,1), (wicked?,3), (Francaise,1), (Herefords,2), (Souls.",1), (/Loci,2), (speak:,9), (half-crowns,1), (Thunder.,18), (Halkar;,2), (HISTORIES.,1), (feats;,1), (robin,1), (fixed-I,1), (undeterred,2), (fastenings,4), ...

 

[Hadoop2.7.1] Can't run Datanode


Error : java.io.IOException: Incompatible clusterIDs

Solution : 
\rm -r /opt/hadoop-2.7.1/tmp/
hadoop namenode -format

After that, you could start again.



Reference
http://blog.chinaunix.net/uid-20682147-id-4214553.html

2016年5月5日 星期四

[Hadoop2.7.1] Wordcount


hadoop fs -mkdir -p /opt/hadoop-2.7.1/input

hadoop fs -copyFromLocal /opt/hadoop-2.7.1/text/text34mb.txt /opt/hadoop-2.7.1/input

hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output




[hadoop@master01 lib]$ hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output
16/05/05 16:30:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/05/05 16:30:43 INFO input.FileInputFormat: Total input paths to process : 1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: number of splits:1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462429858916_0001
16/05/05 16:30:45 INFO impl.YarnClientImpl: Submitted application application_1462429858916_0001
16/05/05 16:30:45 INFO mapreduce.Job: The url to track the job: http://master01:8088/proxy/application_1462429858916_0001/
16/05/05 16:30:45 INFO mapreduce.Job: Running job: job_1462429858916_0001
16/05/05 16:30:53 INFO mapreduce.Job: Job job_1462429858916_0001 running in uber mode : false
16/05/05 16:30:53 INFO mapreduce.Job:  map 0% reduce 0%
16/05/05 16:31:04 INFO mapreduce.Job:  map 42% reduce 0%
16/05/05 16:31:09 INFO mapreduce.Job:  map 67% reduce 0%
16/05/05 16:31:11 INFO mapreduce.Job:  map 100% reduce 0%
16/05/05 16:31:19 INFO mapreduce.Job:  map 100% reduce 100%
16/05/05 16:31:19 INFO mapreduce.Job: Job job_1462429858916_0001 completed successfully
16/05/05 16:31:19 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=9917184
        FILE: Number of bytes written=15106616
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=35926297
        HDFS: Number of bytes written=3103134
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=15003
        Total time spent by all reduces in occupied slots (ms)=4504
        Total time spent by all map tasks (ms)=15003
        Total time spent by all reduce tasks (ms)=4504
        Total vcore-seconds taken by all map tasks=15003
        Total vcore-seconds taken by all reduce tasks=4504
        Total megabyte-seconds taken by all map tasks=15363072
        Total megabyte-seconds taken by all reduce tasks=4612096
    Map-Reduce Framework
        Map input records=788346
        Map output records=6185757
        Map output bytes=59289268
        Map output materialized bytes=4958589
        Input split bytes=121
        Combine input records=6185757
        Combine output records=328274
        Reduce input groups=272380
        Reduce shuffle bytes=4958589
        Reduce input records=328274
        Reduce output records=272380
        Spilled Records=984822
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=209
        CPU time spent (ms)=11810
        Physical memory (bytes) snapshot=327483392
        Virtual memory (bytes) snapshot=4164567040
        Total committed heap usage (bytes)=219676672
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=35926176
    File Output Format Counters
        Bytes Written=3103134



---------------------------------------------------------------------------------------

Delete the file
hdfs dfs -rm -r /opt/hadoop-2.7.1/output

---------------------------------------------------------------------------------------
[hadoop@master01 lib]$ ls /opt/hadoop-2.7.1/share/hadoop/mapreduce/
hadoop-mapreduce-client-app-2.7.1.jar
hadoop-mapreduce-client-common-2.7.1.jar
hadoop-mapreduce-client-core-2.7.1.jar
hadoop-mapreduce-client-hs-2.7.1.jar
hadoop-mapreduce-client-hs-plugins-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1-tests.jar
hadoop-mapreduce-client-shuffle-2.7.1.jar
hadoop-mapreduce-examples-2.7.1.jar
lib
lib-examples
sources
---------------------------------------------------------------------------------------


Reference
http://kurthung1224.pixnet.net/blog/post/175503049
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

[CentOS] How to Share Your Computer’s Files With a Virtual Machine


Reference

http://www.howtogeek.com/189974/how-to-share-your-computers-files-with-a-virtual-machine/

sudo mkdir c
sudo mount -t vboxsf C_DRIVE /c