编程知识 cdmana.com

Deeply explore the data reading process of Hadoop distributed file system (HDFS)

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. The opening ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hadoop distributed file system (HDFS) yes Hadoop Data storage facilities at the bottom of big data ecology . Because it has the ability of massive data distributed storage , Calculate the carrying capacity of large throughput data for different batch processing services , Make its comprehensive complexity much higher than other data storage systems .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" So right. Hadoop distributed file system (HDFS) In depth study of , Understand its architecture features 、 Read and write flow 、 Partition mode 、 High availability ideas 、 Data storage planning and other knowledge , It's good for learning big data technology , Especially in the face of development and production environment , Be able to know in your mind .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This article focuses on reading from the client HDFS From the perspective of data , adopt Hadoop Source code tracking means , Layer by layer , Go deeper Hadoop Inside the mechanism , Make the reading process clear .","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. HDFS Overall architecture process of data reading ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a2/a29439d647f693a701769d73f0d1e98a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":" HDFS Data access overall architecture process ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" As shown in the figure above : Describes client access HDFS After data simplification, the overall architecture process .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1) Client to hdfs namenode The node sends Path File path data access request ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2) Namenode All data blocks are collected according to the file path (block) Location information for , And according to the order of data blocks in the file , In order to form a data block positioning set (located blocks), Respond to the client ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3) After the client gets the data block location set , establish HDFS Input stream , Locate the location of the first data block , And read datanode The flow of data . Then locate the next one according to the read offset datanode And create a new data block to read the data stream , And so on , Finish right HDFS The entire reading of the file .","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. Hadoop Source code analysis ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" After the above brief description , We read to the client HDFS File data has an overall concept , So this section , We started tracking the direction from the source code , In depth analysis HDFS The internal mechanism of data access .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"( One ) namenode Proxy class generation source code exploration ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Why should we start with namenode Speaking of agent generation ? The reason is to understand the client and namenode The context between , If you look at the data acquisition process after that, you will have a clue .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1) First of all, let's start with a hdfs-site.xml Configuration looks like ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"html"},"content":[{"type":"text","text":"\n \n dfs.client.failover.proxy.provider.fszx\n org.apache.hadoop.hdfs.server.namenode.ha.\n ConfiguredFailoverProxyProvider\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The configuration defines namenode The provider of the agent is ConfiguredFailoverProxyProvider. What do you mean namenode agent ? It's essentially a connection namenode The client network communication object of the service , For clients and namenode Server side communication .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2) Then let's look at ConfiguredFailoverProxyProvider Source code inheritance relation structure of ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/02946eeea933fbff9bc896ba1b7566c1.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"ConfiguredFailoverProxyProvider Inheritance diagram ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Above, ConfiguredFailoverProxyProvider Inheritance relationship of , The top interface is FailoverProxyProvider, It contains a piece of code :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":" /**\n * Get the proxy object which should be used until the next failover event\n * occurs.\n * @return the proxy object to invoke methods upon\n */\n public ProxyInfo getProxy();","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This method returns ProxyInfo Namely namenode Proxy object , Of course, the client gets ProxyInfo The whole process is very complicated , Even dynamic proxies are used , But essentially it's through this interface that we get namenode agent .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3) At this point, the class relationship evolves as shown in the figure below :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1c/1cfd58ea29192a69ff092a46a1207c3f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"namonode Create process class diagrams ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Upper figure ProxyInfo Namely namenode Proxy class , Subclasses of inheritance NNProxyInfo Is to specify a highly available proxy class .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4) So it took so much effort to find out namenode agent , What's its function ?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This needs to focus on a very important object DFSClient 了 , It's all clients to HDFS The starting point of the I / O stream , As shown in the figure below :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6a/6aa0c2ef2c01321a146e7515707a443a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"DFSClient Initialization process class diagram ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The solid line above represents the actual calling process , The dotted line represents the indirect relationship between objects . We can see DFSClient It's a key role , It consists of distributed file system objects (DistributeFileSystem) initialization , And it is called in initialization NameNodeProxiesClient A series of operations , High availability NNproxyInfo objects creating , That is to say namenode agent , And ultimately as DFSClient A member of the object , In the process of creating data stream, etc .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"( Two ) Read file stream in-depth source code exploration ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1) First of all, the method is the same , Cut through the entrance and find one first . Build from HDFS Download files to a simple local scene , Here's the code snippet :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"……\n// open HDFS File input stream \ninput = fileSystem.open(new Path(hdfs_file_path));\n// Create a local file output stream \noutput = new FileOutputStream(local_file_path);\n// adopt IOUtils Tool to achieve data stream byte cycle replication \nIOUtils.copyBytes(input, output, 4096, true);\n……","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Let's see IOUtils A section of file stream read and write method code :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"/**\n * Copies from one stream to another.\n * \n * @param in InputStrem to read from\n * @param out OutputStream to write to\n * @param buffSize the size of the buffer \n */\n public static void copyBytes(InputStream in, OutputStream out, int buffSize) \n throws IOException {\n PrintStream ps = out instanceof PrintStream ? (PrintStream)out : null;\n byte buf[] = new byte[buffSize];\n int bytesRead = in.read(buf);\n while (bytesRead >= 0) {\n out.write(buf, 0, bytesRead);\n if ((ps != null) && ps.checkError()) {\n throw new IOException(\"Unable to write to output stream.\");\n }\n bytesRead = in.read(buf);\n }\n }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This code is a standard loop read HDFS InputStream Data flow , And then send it to the local file OutputStream The process of writing data to the output stream . It's going deep into our goal HDFS InputStream The creation and use of data streams .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2) Next we start to analyze InputStream The production process of , As shown in the figure below :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec8a36797060fa31ba13f7fc985e9d40.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"InputStream Open the process class diagram ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The solid line above represents the actual calling process , The dotted line represents the indirect relationship between objects . The internal structure of the code is extremely complex , I use this diagram in the most simplified way so that we can quickly understand his principle .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Let me briefly explain the process :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The first step is to DistributeFileSystem By calling DFSClient Object's open Method , Realize to DFSInputStream Object creation ,DFSInputStream The object is to actually read a block of data (LocationBlock) As well as datanode The implementation logic of interaction , It's really the core class .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second step ,DFSClient Creating DFSInputStream In the process of , You need to pass in a call for it namenode The set of data blocks returned by proxy (LocationBlocks).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The third step ,DFSClient Create a decorator class HDFSDataInputStream, Encapsulates the DFSInputStream, By the parent of the decorator FSDataInputStream Finally return to DistributeFileSystem, Used by client developers .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3) Finally, let's go deep into the source code of the data block reading mechanism , As shown in the figure below :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c1e86a454da4a48215a319d3523b4486.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"DFSInputStream Data reading process class diagram ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The solid line above represents the actual calling process , The dotted line represents the indirect relationship between objects . The actual code logic is more complex , This picture is also as simple as possible , It's convenient for us to understand .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Same , Let me briefly explain the process :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" First step FSDataInputStream The decorator accepts the client's read call to DFSInputStream Object to carry out read(...) Method call .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The second step DFSInputStream Will call its own blockSeekTo(long offset) Method , On the one hand, according to offset Data offset , Locate whether you want to read a new data block (LocationBlock), On the other hand, the new data block from the data block set (LocationBlocks) After we find it , Looking for the best data node , That is to say Hadoop The so-called principle of proximity , Let's see if the local data node has a copy , Once again, according to the distance from the network, get the copy nearby .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The third step is to pass FSDataInputStream Data block on replica (LocationBlock) structure BlockReader object , What it really reads is a block of data .BlockReader Object, which has different implementations , from BlockReaderFactory.build According to the optimal conditions, the concrete realization is selected ,BlockReaderLocal and BlockReaderLocalLegacy(based on HDFS-2246) It's the preferred solution , It's also short-circuit block readers programme , It's equivalent to reading data directly from the local file system , if short-circuit It's not available because of safety and other factors , I'll try UNIX domain sockets Optimization scheme , If you can't think about it any more BlockReaderRemote establish TCP sockets The connection scheme of .BlockReader The details of the principle is also very worthy of in-depth study , Next time I'll write an article on BlockReader Principle mechanism article .","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. end ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" I feel like you can finish it . I'm going to talk about “","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/4438a7b40a9d832e9eb7c4e67","title":""},"content":[{"type":"text","text":"Hadoop distributed file system (HDFS) Data writing process ","attrs":{}}]},{"type":"text","text":"” Do an in-depth exploration and Analysis . Looking forward to your attention .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" author : Fang Shun Xi'an Guardian stone information technology founder betake IT Technical improvement of engineers in the field of big data ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://www.zhihu.com/column/c_151487501","title":null},"content":[{"type":"text","text":" Go to my Zhihu column —— Learn more about big data ","attrs":{}}]},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c1/c18eee3b3b896897a333305bc70cfc88.jpeg","alt":null,"title":" official account : Guardian stone on data ","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201225123339493Z.html

Scroll to Top