類比namenode崩潰,例如將name目錄的內容全部刪除,然後通過secondary namenode恢復namenode,抓圖實驗過程
If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software
to another machine is not supported
|
Namenode主要有底下兩個檔案組成, fsimage 為存放checkpoint的namespace , edits為異動的紀錄,可用來恢復namenode, 說明如下:
NameNode persists its namespace using two files:
fsimage, which is
the latest checkpoint of the namespace and
edits, a
journal (log) of changes to the namespace since the checkpoint.
如果NameNode節點掛了,可以按照如下步驟來從Secondary
NameNode來恢復:
觀察:
fs.checkpoint.dir :在secondary namenode, 在namenode皆有檔案
dfs.data.dir , 在namenode上面, secondary namenode會是空的
dfs.name.dir : 存放在namenode, 在secondary
namenode上面會是空的
dfs.name.edits.dir: 存放在namenode, 在secondary
namenode上面會是空的
結果發現以下兩個參數屬於namenode (nn) :
dfs.name.dir
dfs.name.edits.dir
兩個參數屬於secondary namenode (snn) :
fs.checkpoint.edits.dir
fs.checkpoint.dir
實際測試步驟如下:
hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop secondarynamenode
模擬數據全毀
rm -rf /opt/HDP/meta/*
建立資料夾回namenode
mkdir -p /home/hadoop/tmp /opt/HDP/meta
/opt/HDP/meta/edits /opt/HDP/ckpoint/image/edits
ssh dn1 chmod 755 /opt/HDP/data
ssh dn2 chmod 755 /opt/HDP/data
@把snn的checkpoint dir 複製回nn
scp -pr nn2:/opt/HDP/ckpoint/* /opt/HDP/ckpoint/.
mkdir –p /opt/HDP/ckpoint/image/edits
於$conf/core-site.xml暫時加入
Determines where on the local filesystem
theDFS secondary name node should store the temporary images to merge. If
this isa comma-delimited list of directories then the image is replicated in
all of thedirectories for redundancy.
|
hadoop namenode -importCheckpoint
13/10/29 22:55:15 INFO ipc.Server: IPC Server
listener on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 1 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 2 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 4 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 5 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 6 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 7 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 8 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 3 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 0 on 8020: starting
13/10/29 22:55:15 INFO ipc.Server: IPC Server
handler 9 on 8020: starting
13/10/29 23:21:52 INFO logs: Aliases are enabled
13/10/29 23:22:10 ERROR security.UserGroupInformation:
PriviledgedActionException as:hadoop
cause:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not
rolled. Name node is in safe mode.
The reported blocks is only 0 but the threshold is
0.9500 and the total blocks 3. Safe mode will be turned off automatically.
13/10/29 23:22:10 INFO ipc.Server: IPC Server
handler 2 on 8020, call rollEditLog() from 192.168.35.131:45427: error:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled.
Name node is in safe mode.
The reported blocks is only 0 but the threshold is
0.9500 and the total blocks 3. Safe mode will be turned off automatically.
[Ctrl + C] è
[hadoop@nn1 meta]$ hadoop-daemon.sh start namenode
[hadoop@nn2 meta]$ hadoop-daemon.sh start
secondarynamenode
[hadoop@nn1 meta]$ jps
10104 Jps
9786 NameNode
9964 JobTracker
[hadoop@nn1 meta]$ hadoop dfs -ls testjay
Found 1 items
-rw-r--r-- 2 hadoop
supergroup 16 2013-10-29
22:14 /user/hadoop/testjay/test128m
[hadoop@nn1 meta]$ hadoop dfs -cat
testjay/test128m
12312dsofjosdjf
[hadoop@nn1 meta]$
[hadoop@nn1 meta]$ hadoop dfs -cat
testjay/test128m
12312dsofjosdjf
[hadoop@nn1 meta]$ cd
[hadoop@nn1 ~]$ dd if=/dev/zero of=testfile
bs=10240k count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.011087 seconds,
946 MB/s
[hadoop@nn1 ~]$ hadoop fs -put testfile
testjay/testfile
tail -f
hadoop-hadoop-secondarynamenode-nn2.log
eNode: Posted URL
nn1:50070putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383058879000:1383058699786&newChecksum=319867456f25e3b20d3553248b09788b
2013-10-29 23:01:20,205 INFO
org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to
http://nn1:50070/getimage?putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383058879000:1383058699786&newChecksum=319867456f25e3b20d3553248b09788b
2013-10-29 23:01:20,262 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
New Image Size: 989
|
[hadoop@nn1 ~]$ hadoop fs -rm
testjay/testfile
Moved to trash:
hdfs://nn1:8020/user/hadoop/testjay/testfile
tail -f hadoop-hadoop-secondarynamenode-nn2.log
2013-10-29 23:04:20,298 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 0
Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0
Number of syncs: 0 SyncTimes(ms): 0
2013-10-29 23:04:20,299 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log:
position=4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,299 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to
4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,330 INFO
org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to
http://nn1:50070/getimage?getimage=1
2013-10-29 23:04:20,339 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
Downloaded file fsimage size 989 bytes.
2013-10-29 23:04:20,339 INFO
org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to
http://nn1:50070/getimage?getedit=1
2013-10-29 23:04:20,344 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
edits size 976 bytes.
2013-10-29 23:04:20,344 INFO
org.apache.hadoop.hdfs.util.GSet: Computing capacity for map BlocksMap
2013-10-29 23:04:20,344 INFO org.apache.hadoop.hdfs.util.GSet:
VM type = 64-bit
2013-10-29 23:04:20,345 INFO
org.apache.hadoop.hdfs.util.GSet: 2.0% max memory = 1013645312
2013-10-29 23:04:20,345 INFO
org.apache.hadoop.hdfs.util.GSet: capacity =
2^21 = 2097152 entries
2013-10-29 23:04:20,345 INFO
org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152
2013-10-29 23:04:20,575 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop
2013-10-29 23:04:20,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
supergroup=supergroup
2013-10-29 23:04:20,576 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2013-10-29 23:04:20,577 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100
2013-10-29 23:04:20,577 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
2013-10-29 23:04:20,577 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog:
dfs.namenode.edits.toleration.length = -1
2013-10-29 23:04:20,578 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring
more than 10 times
2013-10-29 23:04:20,581 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files = 11
2013-10-29 23:04:20,594 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files under
construction = 0
2013-10-29 23:04:20,594 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Start loading edits file
/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,606 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: EOF of
/opt/HDP/ckpoint/image/edits/current/edits, reached end of edit log Number of
transactions found: 10. Bytes read: 976
2013-10-29 23:04:20,606 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Edits file /opt/HDP/ckpoint/image/edits/current/edits of size 976 edits # 10
loaded in 0 seconds.
2013-10-29 23:04:20,607 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 0
Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0
Number of syncs: 0 SyncTimes(ms): 0
2013-10-29 23:04:20,612 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log:
position=976, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,612 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: close success: truncate to
976, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,616 INFO
org.apache.hadoop.hdfs.server.common.Storage: Image file /opt/HDP/ckpoint/current/fsimage
of size 1625 bytes saved in 0 seconds.
2013-10-29 23:04:20,712 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: closing edit log:
position=4, editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,713 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
close success: truncate to 4,
editlog=/opt/HDP/ckpoint/image/edits/current/edits
2013-10-29 23:04:20,728 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL nn1:50070putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383059060000:1383058880239&newChecksum=25aedb3e4690896fc583ba7bab176cd7
2013-10-29 23:04:20,728 INFO
org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to
http://nn1:50070/getimage?putimage=1&port=50090&machine=nn2&token=-41:62854137:0:1383059060000:1383058880239&newChecksum=25aedb3e4690896fc583ba7bab176cd7
2013-10-29 23:04:20,782 INFO
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
New Image Size: 1625
|
以上恢復namenode心得, datanode不須重新啟動, 僅須停止nn, snn即可。
估計直接由snn恢復也是類似做法, 但須修改以下參數(原有的nn需改成snn的ip位置) :
fs.default.name
dfs.http.address
dfs.https.address
mapred.job.tracker
mapred.job.tracker.http.address
且調整以下參數到新的snn
dfs.secondary.http.address
$conf/masters 的內容也要改成新的snn
沒有留言:
張貼留言