先到spark資料夾
cd /opt/spark/
開啟spark-shell
sbin/start-all.sh
開啟spark-shell
bin/spark-shell
建立path到我們要讀的檔案
val path = "/in/123.txt"
把檔案讀進去,sc是SparkContext的縮寫
val file = sc.textFile(path)
file變成了一個RDD,要用collect指令看RDD裡的東西
file.collect
val line1 = file.flatMap(_.split(" "))
line1.collect
val line2 = line1.filter(_ != "")
line2.collect
val line3 = line2.map(s=> (s,1))
line3.collect
val line4 = line3.reduceByKey(_ + _)
line4.collect
line4.take(10)
line4.take(10).foreach(println)
官網一行指令
val wordCounts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
來看看執行狀況
http://[node IP]:4040/jobs/
[參考資料]
https://spark.apache.org/docs/latest/quick-start.html
http://kurthung1224.pixnet.net/blog/post/275207950
2015年5月4日 星期一
[Hadoop] Word count 範例實做教學
Hadoop on cloudera quickstart vm test example 01 wordcount
mkdir temp
cd temp
ls -ltr
echo "this is huiming and you can call me juiming or killniu i am good at statistical modeling and data analysis" > wordcount.txt
hdfs dfs -mkdir /user/cloudera/input
hdfs dfs -ls /user/cloudera/input
hdfs dfs -put /home/cloudera/temp/wordcount.txt /user/cloudera/input
hdfs dfs -ls /user/cloudera/input
應該會出現剛剛創造的wordcount.txt
ls -ltr /usr/lib/hadoop-mapreduce/
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-example.jar
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-example.jar wordcount /user/cloudera/input/wordcount.txt /user/cloudera/output
hdfs dfs -ls /user/cloudera/output
hdfs dfs -cat /user/cloudera/output/part-r-00000
最後就會跑出word count囉
訂閱:
文章 (Atom)
