Getting started with PySpark
[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ ls
docs lib pyspark run-tests run-tests.py test_support
[hadoop@master01 python]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/17 20:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.
Word count example
>>> lines = sc.textFile('hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt')
>>> lines_nonempty = lines.filter( lambda x: len(x) > 0 )
>>> lines_nonempty.count()
662761
>>>
>>> words = lines_nonempty.flatMap(lambda x: x.split())
>>> wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)
>>> wordcounts.take(10)
[(319239, u'the'), (204299, u'of'), (158585, u'and'), (149022, u'to'), (113795, u'a'), (94854, u'in'), (78748, u'I'), (65001, u'that'), (52567, u'his'), (52506, u'was')]
[Reference]
Getting started with PySpark - Part 1
http://www.mccarroll.net/blog/pyspark/
2016年5月17日 星期二
2016年5月9日 星期一
[Spark] Collaborative Filtering, alternating least squares (ALS) practice
Collaborative Filtering - spark.mllib
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.
Result :
Mean Squared Error = 5.491294660658085E-6

-------------------------------------------------------------------------------------------------------
ERROR : taskSchedulerImpl: Initial job has not accepted any resources
http://www.datastax.com/dev/blog/common-spark-troubleshooting
-------------------------------------------------------------------------------------------------------
ALS
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$
ALS.scala
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
Movie Recommendations with MLlib
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Dataset - MovieLens 1M Dataset
http://grouplens.org/datasets/movielens/
2016年5月6日 星期五
[Spark1.6.0] ERROR SparkContext: Error initializing SparkContext
[hadoop@master01 spark-1.6.0]$ spark-shell
16/05/06 16:46:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
16/05/06 16:46:54 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Log directory hdfs:///user/spark/eventlog does not exist.
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:101)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:549)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $line3.$read$$iwC$$iwC.<init>(<console>:15)
at $line3.$read$$iwC.<init>(<console>:24)
at $line3.$read.<init>(<console>:26)
at $line3.$read$.<init>(<console>:30)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
Solution :
hdfs dfs -mkdir -p /user/spark/eventlog
[Spark1.6.0] Install Scala & Spark
Download and install Scala 2.11.8
Set Scala configure
---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc
#scala
export SCALA_HOME=/opt/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
source ~/.bashrc
---------------------------------------------------------------------------------------
test
[hadoop@master01 lib]$ scala
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91).
Type in expressions for evaluation. Or try :help.
scala> 1+1
res0: Int = 2
---------------------------------------------------------------------------------------
Download and install Spark 1.6.0 on Hadoop 2.6
Set Spark configure
---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc
#Spark
export SPARK_HOME=/opt/spark-1.6.0
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
---------------------------------------------------------------------------------------
cp spark-env.sh.template spark-env.sh
sudo gedit spark-env.sh
export SCALA_HOME=/opt/scala-2.11.8
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export SPARK_MASTER_IP=master01
export SPARK_WORKER_MEMORY=1024m
spark.master spark://master01:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/spark/eventlog
ps aux | grep spark
hadoop 969 0.0 0.0 112644 952 pts/0 R+ 21:21 0:00 grep --color=auto spark
---------------------------------------------------------------------------------------
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
---------------------------------------------------------------------------------------
Word count example
scala> val textFile = sc.textFile("hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:29
scala> wordCounts.collect()
res0: Array[(String, Int)] = Array(('lopin',1), (Ah!,99), (houres,,36), (Committee,),1), (bone,40), (fleein',1), (�Head.�,1), (delinquents.,2), (Malwa,1), (routing*,2), ('farthest,1), (Dollours,2), (Feldkirch,,3), ((1754-1831),,1), (nothin,1), (untruthfulness.,1), (signal.,6), (langwidge,3), (drad;*,1), (meets,,3), (Lost.,3), (Papists,,6), (accompts,,2), (Goodbye!,1), (Galliard,4), ((1563-1631),1), (Anthonio,,40), (God-forsaken,4), (rightly-,1), (fowl,30), (coat;,3), (husky,5), (Carpenter,4), (precious*,1), (ampullaria,1), (afterward,64), (armes*,,2), (entend*,1), (provisioned,,1), (wicked?,3), (Francaise,1), (Herefords,2), (Souls.",1), (/Loci,2), (speak:,9), (half-crowns,1), (Thunder.,18), (Halkar;,2), (HISTORIES.,1), (feats;,1), (robin,1), (fixed-I,1), (undeterred,2), (fastenings,4), ...
[Hadoop2.7.1] Can't run Datanode
Error : java.io.IOException: Incompatible clusterIDs
Solution :
\rm -r /opt/hadoop-2.7.1/tmp/
hadoop namenode -format
After that, you could start again.
Reference
http://blog.chinaunix.net/uid-20682147-id-4214553.html
2016年5月5日 星期四
[Hadoop2.7.1] Wordcount
hadoop fs -mkdir -p /opt/hadoop-2.7.1/input
hadoop fs -copyFromLocal /opt/hadoop-2.7.1/text/text34mb.txt /opt/hadoop-2.7.1/input
hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output
[hadoop@master01 lib]$ hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output
16/05/05 16:30:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/05/05 16:30:43 INFO input.FileInputFormat: Total input paths to process : 1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: number of splits:1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462429858916_0001
16/05/05 16:30:45 INFO impl.YarnClientImpl: Submitted application application_1462429858916_0001
16/05/05 16:30:45 INFO mapreduce.Job: The url to track the job: http://master01:8088/proxy/application_1462429858916_0001/
16/05/05 16:30:45 INFO mapreduce.Job: Running job: job_1462429858916_0001
16/05/05 16:30:53 INFO mapreduce.Job: Job job_1462429858916_0001 running in uber mode : false
16/05/05 16:30:53 INFO mapreduce.Job: map 0% reduce 0%
16/05/05 16:31:04 INFO mapreduce.Job: map 42% reduce 0%
16/05/05 16:31:09 INFO mapreduce.Job: map 67% reduce 0%
16/05/05 16:31:11 INFO mapreduce.Job: map 100% reduce 0%
16/05/05 16:31:19 INFO mapreduce.Job: map 100% reduce 100%
16/05/05 16:31:19 INFO mapreduce.Job: Job job_1462429858916_0001 completed successfully
16/05/05 16:31:19 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=9917184
FILE: Number of bytes written=15106616
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=35926297
HDFS: Number of bytes written=3103134
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=15003
Total time spent by all reduces in occupied slots (ms)=4504
Total time spent by all map tasks (ms)=15003
Total time spent by all reduce tasks (ms)=4504
Total vcore-seconds taken by all map tasks=15003
Total vcore-seconds taken by all reduce tasks=4504
Total megabyte-seconds taken by all map tasks=15363072
Total megabyte-seconds taken by all reduce tasks=4612096
Map-Reduce Framework
Map input records=788346
Map output records=6185757
Map output bytes=59289268
Map output materialized bytes=4958589
Input split bytes=121
Combine input records=6185757
Combine output records=328274
Reduce input groups=272380
Reduce shuffle bytes=4958589
Reduce input records=328274
Reduce output records=272380
Spilled Records=984822
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=209
CPU time spent (ms)=11810
Physical memory (bytes) snapshot=327483392
Virtual memory (bytes) snapshot=4164567040
Total committed heap usage (bytes)=219676672
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=35926176
File Output Format Counters
Bytes Written=3103134
---------------------------------------------------------------------------------------
Delete the file
hdfs dfs -rm -r /opt/hadoop-2.7.1/output
---------------------------------------------------------------------------------------
[hadoop@master01 lib]$ ls /opt/hadoop-2.7.1/share/hadoop/mapreduce/
hadoop-mapreduce-client-app-2.7.1.jar
hadoop-mapreduce-client-common-2.7.1.jar
hadoop-mapreduce-client-core-2.7.1.jar
hadoop-mapreduce-client-hs-2.7.1.jar
hadoop-mapreduce-client-hs-plugins-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1-tests.jar
hadoop-mapreduce-client-shuffle-2.7.1.jar
hadoop-mapreduce-examples-2.7.1.jar
lib
lib-examples
sources
---------------------------------------------------------------------------------------
Reference
http://kurthung1224.pixnet.net/blog/post/175503049
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0
[CentOS] How to Share Your Computer’s Files With a Virtual Machine
Reference
http://www.howtogeek.com/189974/how-to-share-your-computers-files-with-a-virtual-machine/
sudo mkdir c
sudo mount -t vboxsf C_DRIVE /c
訂閱:
文章 (Atom)