2016年12月25日 星期日
Data Analyst的準備(進行中)
Data Analyst的準備,筆者正在學習中
1. SQL - Codecademy
大三修過資料庫系統但是現在通通忘光了,在柏林擔任資料分析師的學長推薦速成班
https://www.codecademy.com/learn/all
將基本語法分成許多小模組,每個指令都有詳細的解釋,非常容易學習
推薦學習順序:
Learn SQL
SQL: Analyzing Business Metrics
2. Python - coursera
https://www.coursera.org/specializations/python
推薦課程:
給所有人的程式語言 (Python入門)
用Python玩转数据 Data Processing Using Python
3. Machine Learning - coursera
https://www.coursera.org/learn/machine-learning
最有名的就是Andrew Ng所開的機器學習課程,適合新手入門
台大開的機器學習觀念講得更細,但是比較進階,較需要時間消化
How to Get a Data Analyst Job in 9 months
https://www.datascienceweekly.org/articles/how-to-get-a-data-analyst-job-in-9-months
How to Start a Career in Analytics for Free? (她得到了Accenture Analytics的工作)
https://akshaykher.wordpress.com/2015/08/18/how-to-start-a-career-in-analytics-for-free-3/
2016年6月20日 星期一
[Hadoop] ERROR : Name node is in safe mode
SafeModeException : Name node is in safe mode
Solution :
Solution :
hdfs dfsadmin -safemode leave
2016年6月9日 星期四
[PySpark] From Pandas to Apache Spark’s DataFrame
>>> from pyspark.sql import SQLContext
>>> sqlCtx = SQLContext(sc)
>>> spark_df = sqlCtx.createDataFrame(pandas_df)
16/06/09 19:24:46 WARN TaskSetManager: Stage 0 contains a task of very large size (8851 KB). The maximum recommended task size is 100 KB.
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
|Store|DayOfWeek| Date|Sales|Customers|Open|Promo|StateHoliday|SchoolHoliday|
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
| 1| 5|1438300800000000000| 5263| 555| 1| 1| 0| 1|
| 2| 5|1438300800000000000| 6064| 625| 1| 1| 0| 1|
| 3| 5|1438300800000000000| 8314| 821| 1| 1| 0| 1|
| 4| 5|1438300800000000000|13995| 1498| 1| 1| 0| 1|
| 5| 5|1438300800000000000| 4822| 559| 1| 1| 0| 1|
| 6| 5|1438300800000000000| 5651| 589| 1| 1| 0| 1|
| 7| 5|1438300800000000000|15344| 1414| 1| 1| 0| 1|
| 8| 5|1438300800000000000| 8492| 833| 1| 1| 0| 1|
| 9| 5|1438300800000000000| 8565| 687| 1| 1| 0| 1|
| 10| 5|1438300800000000000| 7185| 681| 1| 1| 0| 1|
| 11| 5|1438300800000000000|10457| 1236| 1| 1| 0| 1|
| 12| 5|1438300800000000000| 8959| 962| 1| 1| 0| 1|
| 13| 5|1438300800000000000| 8821| 568| 1| 1| 0| 0|
| 14| 5|1438300800000000000| 6544| 710| 1| 1| 0| 1|
| 15| 5|1438300800000000000| 9191| 766| 1| 1| 0| 1|
| 16| 5|1438300800000000000|10231| 979| 1| 1| 0| 1|
| 17| 5|1438300800000000000| 8430| 946| 1| 1| 0| 1|
| 18| 5|1438300800000000000|10071| 936| 1| 1| 0| 1|
| 19| 5|1438300800000000000| 8234| 718| 1| 1| 0| 1|
| 20| 5|1438300800000000000| 9593| 974| 1| 1| 0| 0|
+-----+---------+-------------------+-----+---------+----+-----+------------+-------------+
only showing top 20 rows
[Reference]
Introducing DataFrames in Apache Spark for Large Scale Data Science
2016年5月25日 星期三
[VirtualBox] 調整CentOS VM硬碟大小配置
VM配置的硬碟不夠用了,懶得重開重灌,因此決定增加硬碟的配置
環境:Windows 7下執行VirtualBox
首先以系統管理員執行cmd,進入VirtualBox安裝路徑
cd C:\Program Files\Oracle\VirtualBox
VBoxManage modifyhd "虛擬硬碟名稱.vdi" - - resize 新大小(MB)
[ex] VBoxManage modifyhd "D:\VirtualBox\VirtualBox VMs\CentOS 7 D" --resize 12288
接下來進入VM以root執行以下指令
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| df fdisk -l fdisk /dev/sda d 2 n p 2 < return > < return > w reboot pvresize /dev/sda2 pvscan lvextend -l +100%FREE /dev/mapper/centos-root xfs_growfs /dev/mapper/centos-root df |
最後感謝宅宅學弟的幫忙,他一開說只有87%的機率會成功
結果毀了我一個VM後,終於在第二個VM成功了!(灑花)
http://syuanme.blogspot.tw/
[Reference]
VirtualBox 增加虛擬機器的硬碟空間 – CentOS
http://ims.tw/archives/1017
VirtualBox: Increase Size of RHEL/Fedora/CentOS/Scientific Guest File System
https://blog.jyore.com/2013/06/virtualbox-increase-size-of-rhelfedoracentosscientificos-guest-file-system/
2016年5月20日 星期五
[PySpark] MLlib Regression example
[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/19 20:10:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.types import *
>>> from pyspark.sql import Row
>>> rdd = sc.textFile('file:/opt/data/Sacramentorealestatetransactions.csv')
>>> rdd = rdd.map(lambda line: line.split(","))
[hadoop@master01 python]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/19 20:10:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.types import *
>>> from pyspark.sql import Row
>>> rdd = sc.textFile('file:/opt/data/Sacramentorealestatetransactions.csv')
>>> rdd = rdd.map(lambda line: line.split(","))
Now now we can see that each line has been broken into Spark's RDD tuple format, which is what we want. However, we'll want to remove the header before we convert to a DataFrame since there's not a straightforward way (that I know of) to tell Spark to interpret that header as a list of column names.
>>> header = rdd.first()
>>> rdd = rdd.filter(lambda line:line != header)
Now we can see that the header has been removed.
>>> df = rdd.map(lambda line: Row(street = line[0], city = line[1], zip=line[2], beds=line[4], baths=line[5], sqft=line[6], price=line[9])).toDF()
16/05/19 20:11:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/19 20:11:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/19 20:11:08 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/05/19 20:11:08 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/05/19 20:11:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/05/19 20:11:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
>>>
>>>
>>>
>>> favorite_zip = df[df.zip == 95815]
>>> favorite_zip.show(5)
+-----+----+----------+------+----+----------------+-----+
|baths|beds| city| price|sqft| street| zip|
+-----+----+----------+------+----+----------------+-----+
| 1| 2|SACRAMENTO| 68880| 796| 2796 BRANCH ST|95815|
| 1| 2|SACRAMENTO| 69307| 852|2805 JANETTE WAY|95815|
| 1| 1|SACRAMENTO|106852| 871| 2930 LA ROSA RD|95815|
| 1| 2|SACRAMENTO| 78000| 800| 3132 CLAY ST|95815|
| 2| 4|SACRAMENTO| 89000|1316| 483 ARCADE BLVD|95815|
+-----+----+----------+------+----+----------------+-----+
only showing top 5 rows
>>>
>>>
>>> import pyspark.mllib
>>> import pyspark.mllib.regression
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.sql.functions import *
Let's remove those rows that have suspicious 0 values for any of the features we want to use for prediction
>>> df = df.select('price','baths','beds','sqft')
>>> df = df[df.baths > 0]
>>> df = df[df.beds > 0]
>>> df = df[df.sqft > 0]
>>> df.describe(['baths','beds','price','sqft']).show()
+-------+------------------+------------------+------------------+------------------+
|summary| baths| beds| price| sqft|
+-------+------------------+------------------+------------------+------------------+
| count| 814| 814| 814| 814|
| mean|1.9606879606879606|3.2444717444717446| 229448.3697788698|1591.1461916461917|
| stddev|0.6698038253879438|0.8521372615281976|119825.57606009026| 663.8419297942894|
| min| 1| 1| 100000| 1000|
| max| 5| 8| 99000| 998|
+-------+------------------+------------------+------------------+------------------+
Labeled Points and Scaling Data
>>>
>>> temp = df.map(lambda line:LabeledPoint(line[0],[line[1:]]))
>>> temp.take(5)
[LabeledPoint(59222.0, [1.0,2.0,836.0]), LabeledPoint(68212.0, [1.0,3.0,1167.0]), LabeledPoint(68880.0, [1.0,2.0,796.0]), LabeledPoint(69307.0, [1.0,2.0,852.0]), LabeledPoint(81900.0, [1.0,2.0,797.0])]
>>>
>>>
>>>
>>> from pyspark.mllib.util import MLUtils
>>> from pyspark.mllib.linalg import Vectors
>>> from pyspark.mllib.feature import StandardScaler
>>>
>>> features = df.map(lambda row: row[1:])
>>> features.take(5)
[(u'1', u'2', u'836'), (u'1', u'3', u'1167'), (u'1', u'2', u'796'), (u'1', u'2', u'852'), (u'1', u'2', u'797')]
>>>
>>>
>>>
>>> standardizer = StandardScaler()
>>> model = standardizer.fit(features)
>>> features_transform = model.transform(features)
>>>
>>> features_transform.take(5)
[DenseVector([1.493, 2.347, 1.2593]), DenseVector([1.493, 3.5206, 1.7579]), DenseVector([1.493, 2.347, 1.1991]), DenseVector([1.493, 2.347, 1.2834]), DenseVector([1.493, 2.347, 1.2006])]
>>>
>>>
>>> lab = df.map(lambda row: row[0])
>>> lab.take(5)
[u'59222', u'68212', u'68880', u'69307', u'81900']
>>>
>>> transformedData = lab.zip(features_transform)
>>> transformedData.take(5)
[(u'59222', DenseVector([1.493, 2.347, 1.2593])), (u'68212', DenseVector([1.493, 3.5206, 1.7579])), (u'68880', DenseVector([1.493, 2.347, 1.1991])), (u'69307', DenseVector([1.493, 2.347, 1.2834])), (u'81900', DenseVector([1.493, 2.347, 1.2006]))]
>>>
>>>
>>> transformedData = transformedData.map(lambda row: LabeledPoint(row[0],[row[1]]))
>>> transformedData.take(5)
[LabeledPoint(59222.0, [1.49297445326,2.34703972035,1.25933593899]), LabeledPoint(68212.0, [1.49297445326,3.52055958053,1.7579486134]), LabeledPoint(68880.0, [1.49297445326,2.34703972035,1.19908063091]), LabeledPoint(69307.0, [1.49297445326,2.34703972035,1.28343806223]), LabeledPoint(81900.0, [1.49297445326,2.34703972035,1.20058701361])]
>>>
>>>
>>> trainingData, testingData = transformedData.randomSplit([.8,.2],seed=1234)
>>> from pyspark.mllib.regression import LinearRegressionWithSGD
>>> linearModel = LinearRegressionWithSGD.train(trainingData,1000,.2)
16/05/19 20:13:49 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/05/19 20:13:49 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
>>> linearModel.weights
DenseVector([15098.627, 3792.023, 70216.8097])
>>>
>>>
>>> testingData.take(10)
[LabeledPoint(100309.0, [2.98594890652,3.52055958053,1.36930187625]), LabeledPoint(124100.0, [2.98594890652,3.52055958053,2.41171870613]), LabeledPoint(148750.0, [2.98594890652,4.69407944071,2.21739533756]), LabeledPoint(150000.0, [1.49297445326,1.17351986018,1.14485085363]), LabeledPoint(161500.0, [2.98594890652,4.69407944071,2.3906293483]), LabeledPoint(166357.0, [1.49297445326,4.69407944071,2.94497818269]), LabeledPoint(168000.0, [2.98594890652,3.52055958053,2.22492725107]), LabeledPoint(178480.0, [2.98594890652,3.52055958053,1.78506350204]), LabeledPoint(181872.0, [1.49297445326,3.52055958053,1.73535287287]), LabeledPoint(182587.0, [4.47892335978,4.69407944071,2.78831438167])]
>>>
>>>
>>> linearModel.predict([1.49297445326,3.52055958053,1.73535287287])
157742.84989605084
>>>
>>>
>>> from pyspark.mllib.evaluation import RegressionMetrics
>>> prediObserRDDin = trainingData.map(lambda row: (float(linearModel.predict(row.features[0])),row.label))
>>> metrics = RegressionMetrics(prediObserRDDin)
>>>
>>>
>>> metrics.r2
0.4969184679643588
>>>
>>>
>>> prediObserRDDout = testingData.map(lambda row: (float(linearModel.predict(row.features[0])),row.label))
>>> metrics = RegressionMetrics(prediObserRDDout)
>>>
>>>
>>> etrics.rootMeanSquaredError
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'etrics' is not defined
>>> metrics.rootMeanSquaredError
94895.10434498572
[Reference]
http://www.techpoweredmath.com/spark-dataframes-mllib-tutorial/
2016年5月17日 星期二
[PySpark] Getting started with PySpark
Getting started with PySpark
[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ ls
docs lib pyspark run-tests run-tests.py test_support
[hadoop@master01 python]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/17 20:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.
Word count example
>>> lines = sc.textFile('hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt')
>>> lines_nonempty = lines.filter( lambda x: len(x) > 0 )
>>> lines_nonempty.count()
662761
>>>
>>> words = lines_nonempty.flatMap(lambda x: x.split())
>>> wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)
>>> wordcounts.take(10)
[(319239, u'the'), (204299, u'of'), (158585, u'and'), (149022, u'to'), (113795, u'a'), (94854, u'in'), (78748, u'I'), (65001, u'that'), (52567, u'his'), (52506, u'was')]
[Reference]
Getting started with PySpark - Part 1
http://www.mccarroll.net/blog/pyspark/
[hadoop@master01 spark-1.6.0]$ cd /opt/spark-1.6.0/python/
[hadoop@master01 python]$ ls
docs lib pyspark run-tests run-tests.py test_support
[hadoop@master01 python]$ pyspark
Python 2.7.5 (default, Nov 20 2015, 02:00:19)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
16/05/17 20:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)
SparkContext available as sc, HiveContext available as sqlContext.
Word count example
>>> lines = sc.textFile('hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt')
>>> lines_nonempty = lines.filter( lambda x: len(x) > 0 )
>>> lines_nonempty.count()
662761
>>>
>>> words = lines_nonempty.flatMap(lambda x: x.split())
>>> wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)
>>> wordcounts.take(10)
[(319239, u'the'), (204299, u'of'), (158585, u'and'), (149022, u'to'), (113795, u'a'), (94854, u'in'), (78748, u'I'), (65001, u'that'), (52567, u'his'), (52506, u'was')]
[Reference]
Getting started with PySpark - Part 1
http://www.mccarroll.net/blog/pyspark/
2016年5月9日 星期一
[Spark] Collaborative Filtering, alternating least squares (ALS) practice
Collaborative Filtering - spark.mllib
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#collaborative-filtering
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.
Result :
Mean Squared Error = 5.491294660658085E-6
-------------------------------------------------------------------------------------------------------
ERROR : taskSchedulerImpl: Initial job has not accepted any resources
http://www.datastax.com/dev/blog/common-spark-troubleshooting
-------------------------------------------------------------------------------------------------------
ALS
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$
ALS.scala
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
Movie Recommendations with MLlib
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Dataset - MovieLens 1M Dataset
http://grouplens.org/datasets/movielens/
2016年5月6日 星期五
[Spark1.6.0] ERROR SparkContext: Error initializing SparkContext
[hadoop@master01 spark-1.6.0]$ spark-shell
16/05/06 16:46:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
16/05/06 16:46:54 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Log directory hdfs:///user/spark/eventlog does not exist.
at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:101)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:549)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $line3.$read$$iwC$$iwC.<init>(<console>:15)
at $line3.$read$$iwC.<init>(<console>:24)
at $line3.$read.<init>(<console>:26)
at $line3.$read$.<init>(<console>:30)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
Solution :
hdfs dfs -mkdir -p /user/spark/eventlog
[Spark1.6.0] Install Scala & Spark
Download and install Scala 2.11.8
Set Scala configure
---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc
#scala
export SCALA_HOME=/opt/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
source ~/.bashrc
---------------------------------------------------------------------------------------
test
[hadoop@master01 lib]$ scala
Welcome to Scala 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91).
Type in expressions for evaluation. Or try :help.
scala> 1+1
res0: Int = 2
---------------------------------------------------------------------------------------
Download and install Spark 1.6.0 on Hadoop 2.6
Set Spark configure
---------------------------------------------------------------------------------------
sudo gedit ~/.bashrc
#Spark
export SPARK_HOME=/opt/spark-1.6.0
export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
---------------------------------------------------------------------------------------
cp spark-env.sh.template spark-env.sh
sudo gedit spark-env.sh
export SCALA_HOME=/opt/scala-2.11.8
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export SPARK_MASTER_IP=master01
export SPARK_WORKER_MEMORY=1024m
spark.master spark://master01:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/spark/eventlog
ps aux | grep spark
hadoop 969 0.0 0.0 112644 952 pts/0 R+ 21:21 0:00 grep --color=auto spark
---------------------------------------------------------------------------------------
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
---------------------------------------------------------------------------------------
Word count example
scala> val textFile = sc.textFile("hdfs://master01:9000/opt/hadoop-2.7.1/input/text34mb.txt")
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:27
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[12] at reduceByKey at <console>:29
scala> wordCounts.collect()
res0: Array[(String, Int)] = Array(('lopin',1), (Ah!,99), (houres,,36), (Committee,),1), (bone,40), (fleein',1), (�Head.�,1), (delinquents.,2), (Malwa,1), (routing*,2), ('farthest,1), (Dollours,2), (Feldkirch,,3), ((1754-1831),,1), (nothin,1), (untruthfulness.,1), (signal.,6), (langwidge,3), (drad;*,1), (meets,,3), (Lost.,3), (Papists,,6), (accompts,,2), (Goodbye!,1), (Galliard,4), ((1563-1631),1), (Anthonio,,40), (God-forsaken,4), (rightly-,1), (fowl,30), (coat;,3), (husky,5), (Carpenter,4), (precious*,1), (ampullaria,1), (afterward,64), (armes*,,2), (entend*,1), (provisioned,,1), (wicked?,3), (Francaise,1), (Herefords,2), (Souls.",1), (/Loci,2), (speak:,9), (half-crowns,1), (Thunder.,18), (Halkar;,2), (HISTORIES.,1), (feats;,1), (robin,1), (fixed-I,1), (undeterred,2), (fastenings,4), ...
[Hadoop2.7.1] Can't run Datanode
Error : java.io.IOException: Incompatible clusterIDs
Solution :
\rm -r /opt/hadoop-2.7.1/tmp/
hadoop namenode -format
After that, you could start again.
Reference
http://blog.chinaunix.net/uid-20682147-id-4214553.html
2016年5月5日 星期四
[Hadoop2.7.1] Wordcount
hadoop fs -mkdir -p /opt/hadoop-2.7.1/input
hadoop fs -copyFromLocal /opt/hadoop-2.7.1/text/text34mb.txt /opt/hadoop-2.7.1/input
hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output
[hadoop@master01 lib]$ hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /opt/hadoop-2.7.1/input/text34mb.txt /opt/hadoop-2.7.1/output
16/05/05 16:30:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/05/05 16:30:43 INFO input.FileInputFormat: Total input paths to process : 1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: number of splits:1
16/05/05 16:30:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462429858916_0001
16/05/05 16:30:45 INFO impl.YarnClientImpl: Submitted application application_1462429858916_0001
16/05/05 16:30:45 INFO mapreduce.Job: The url to track the job: http://master01:8088/proxy/application_1462429858916_0001/
16/05/05 16:30:45 INFO mapreduce.Job: Running job: job_1462429858916_0001
16/05/05 16:30:53 INFO mapreduce.Job: Job job_1462429858916_0001 running in uber mode : false
16/05/05 16:30:53 INFO mapreduce.Job: map 0% reduce 0%
16/05/05 16:31:04 INFO mapreduce.Job: map 42% reduce 0%
16/05/05 16:31:09 INFO mapreduce.Job: map 67% reduce 0%
16/05/05 16:31:11 INFO mapreduce.Job: map 100% reduce 0%
16/05/05 16:31:19 INFO mapreduce.Job: map 100% reduce 100%
16/05/05 16:31:19 INFO mapreduce.Job: Job job_1462429858916_0001 completed successfully
16/05/05 16:31:19 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=9917184
FILE: Number of bytes written=15106616
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=35926297
HDFS: Number of bytes written=3103134
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=15003
Total time spent by all reduces in occupied slots (ms)=4504
Total time spent by all map tasks (ms)=15003
Total time spent by all reduce tasks (ms)=4504
Total vcore-seconds taken by all map tasks=15003
Total vcore-seconds taken by all reduce tasks=4504
Total megabyte-seconds taken by all map tasks=15363072
Total megabyte-seconds taken by all reduce tasks=4612096
Map-Reduce Framework
Map input records=788346
Map output records=6185757
Map output bytes=59289268
Map output materialized bytes=4958589
Input split bytes=121
Combine input records=6185757
Combine output records=328274
Reduce input groups=272380
Reduce shuffle bytes=4958589
Reduce input records=328274
Reduce output records=272380
Spilled Records=984822
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=209
CPU time spent (ms)=11810
Physical memory (bytes) snapshot=327483392
Virtual memory (bytes) snapshot=4164567040
Total committed heap usage (bytes)=219676672
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=35926176
File Output Format Counters
Bytes Written=3103134
---------------------------------------------------------------------------------------
Delete the file
hdfs dfs -rm -r /opt/hadoop-2.7.1/output
---------------------------------------------------------------------------------------
[hadoop@master01 lib]$ ls /opt/hadoop-2.7.1/share/hadoop/mapreduce/
hadoop-mapreduce-client-app-2.7.1.jar
hadoop-mapreduce-client-common-2.7.1.jar
hadoop-mapreduce-client-core-2.7.1.jar
hadoop-mapreduce-client-hs-2.7.1.jar
hadoop-mapreduce-client-hs-plugins-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1.jar
hadoop-mapreduce-client-jobclient-2.7.1-tests.jar
hadoop-mapreduce-client-shuffle-2.7.1.jar
hadoop-mapreduce-examples-2.7.1.jar
lib
lib-examples
sources
---------------------------------------------------------------------------------------
Reference
http://kurthung1224.pixnet.net/blog/post/175503049
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0
訂閱:
文章 (Atom)