[Aj. NesT สอน Big Data] การฝึกอบรมเชิงปฏิบัติการสร้าง Big Data Infrastructure สำหรับ Big Data Engineers

[Aj. NesT สอน Big Data] การฝึกอบรมเชิงปฏิบัติการสร้าง Big Data Infrastructure for Big Data Engineers and Data Visualization for Data Analysts

Train the Trainers Program: Big Data Architecture and Data Visualization Workshops

PART 1: Introduction to Big Data Technology
PART 2: Data Visualization with Tableau Workshop for Data Analysts
PART 3: Big Data Infrastructure Workshop for Data Engineers (Hadoop, HDFS, and MapReduce)

PART 1: INTRODUCTION TO BIG DATA TECHNOLOGY 

PART 2: DATA VISUALIZATION WITH TABLEAU WORKSHOP FOR DATA ANALYSTS

Tableau Presenting Analytics for Data Visualization (1)

Tableau Presenting Analytics for Data Visualization (2)

——- แบบละเอียด เข้าใจง่ายมากขึ้น ———

[Aj. NesT สอน Tableau] EP.1 Tableau Presenting Analytics for Data Visualization

[Aj. NesT สอน Tableau] EP.2 Tableau Presenting Analytics for Data Visualization

PART 3: BIG DATA INFRASTRUCTURE FOR DATA ENGINEERS 

Hadoop Training

Registration of Amazon Web Services and Hadoop Setup for Big Data Engineers

Importing Data to HDFS in Hadoop for Big Data Engineers

Writing MapReduce with Java Programming for Processing in Hadoop for Big Data Engineers

——- แบบละเอียด เข้าใจง่ายมากขึ้น ———

[Aj. NesT สอน Big Data] EP.0 Using Amazon Web Services (AWS) for Setting Server

[Aj. NesT สอน Big Data] EP.1 Hadoop Setup (Preparing Environment for Big Data) for Data Engineer

[Aj. NesT สอน Big Data] EP.2 Importing Data to HDFS for Data Engineer

[Aj. NesT สอน Big Data] EP.3 Writing MapReduce with Java Programming for Processing in Hadoop

——————————————————————————————

Big Data Laboratory

Download Material https://docs.google.com/document/d/1XaoOpEzbKZzEHdTpvOJfeuFNZtXS0oMNiBUMgysoVlk/edit?usp=sharing

Machines

– Amazon Web Services (AWS) => EC2

– Computer for Remoting to EC2 via Putty software

Lab 1: Preparing Environment for Big Data (Hadoop Setup)  

1.1.) Install Ubuntu (Cloud Server or Virtual Machine)

1.2) Update Ubuntu

$ sudo apt-get update

1.3) Install ssh and Create ssh-key

$ sudo apt-get install -y openssh-server //install ssh and open ssh

$ ssh-keygen -t dsa -P ‘’ //create public and private keys of ssh

$ cat .ssh/id_dsa.pub >> .ssh/authorized_keys //load public key to key store

$ ssh localhost //check available ssh

$ exit //Exit from ssh connection

1.4) Installing Java

$ sudo apt-get install -y openjdk-7-jdk //install java

$ java -version //check java version

1.5) Download and Extract Hadoop

$ wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz  

//download hadoop via url (this lab: version 2.8.0)

$ tar -xvf hadoop-2.8.0.tar.gz  //extract hadoop files

$ sudo mv ./hadoop-2.8.0 /usr/local/hadoop //move to /usr/local/hadoop

1.6) Installing Hadoop

$ cd /usr/local/ //path for config bashrc file -> /usr/local
$ nano ~/.bashrc //config bashrc

—bashrc file (/usr/local)—

//Insert the bottom line

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin


export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

// Exit nano => Keyboard Press Ctrl X -> y -> Enter

//Execute environment variables

$ source ~/.bashrc

$ sudo mkdir /var/log/hadoop //create hadoop directory
$ sudo chown -R ubuntu:ubuntu /var/log/hadoop //change file owner and group


//Edit Hadoop shell script

$ cd /usr/local/hadoop/etc/hadoop/ //go to hadoop-env.sh file
$ nano hadoop-env.sh //config hadoop-env.sh file

 

—hadoop-env.sh (/usr/local/hadoop/etc/hadoop)—
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 //set path java


export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
export HADOOP_LOG_DIR=/var/log/hadoop  //log location to another directory

export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS

_USER}

//Edit Yarn shell script

$ nano yarn-env.sh //open yarn-env.sh file

 

—yarn-env.sh (/usr/local/hadoop/etc/hadoop)—

//Insert above YARN_OPTS line

export YARN_LOG_DIR=/var/log/hadoop //log location to another directory

 

1.7) Configuring Hadoop

//Edit Hadoop – core-site.xml (/usr/local/hadoop/etc/hadoop/)

$ nano core-site.xml //open core-site.xml file

—core-site.xml (/usr/local/hadoop/etc/hadoop)—

//Config Hadoop – core-site.xml

<configuration>
       <property>
               <name>fs.defaultFS</name>
               <value>hdfs://172.31.21.22:9000</value> //ip address
       </property>
</configuration>

//Creating directory for namenode and datanode

$ sudo mkdir -p /var/hadoop_data/namenode
$ sudo mkdir -p /var/hadoop_data/datanode
$ sudo chown ubuntu:ubuntu -R /var/hadoop_data

//Edit hdfs-site.xml

$ nano hdfs-site.xml //open hdfs-site.xml file
—hdfs-site.xml (/usr/local/hadoop/etc/hadoop)—

//Config Hadoop – hdfs-site.xml
<configuration>
       <property>
               <name>dfs.replication</name>
               <value>1</value>
       </property>
       <property>
               <name>dfs.namenode.name.dir</name>
               <value>file:/var/hadoop_data/namenode</value>
       </property>
       <property>
               <name>dfs.datanode.data.dir</name>
               <value>file:/var/hadoop_data/datanode</value>
       </property>
</configuration>

//Edit yarn-site.xml

$ nano yarn-site.xml //open yarn-site.xml file


—yarn-site.xml (/usr/local/hadoop/etc/hadoop)—

//Config Hadoop – yarn-site.xml
<configuration>
      <property>
               <name>yarn.resourcemanager.hostname</name>
               <value>172.31.21.22</value>
//ip address
      </property>
      <property>
               <name>yarn.resourcemanager.scheduler.address</name>
               <value>172.31.21.22:8030</value>
//ip address
      </property>
      <property>
               <name>yarn.resourcemanager.resource-tracker.address</name>
               <value>172.31.21.22:8031</value>
//ip address       

     </property>
    <property>
              <name>yarn.resourcemanager.address</name>
              <value>172.31.21.22:8032</value>
//ip address
    </property>
    <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>172.31.21.22:8033</value>
//ip address
    </property>
    <property>
           <name>yarn.resourcemanager.webapp.address</name>
           <value>172.31.21.22:8088</value>
//ip address
    </property>
    <property>
           <name>yarn.nodemanager.aux-services</name>
           <value>mapreduce_shuffle</value>
    </property>
    <property>                 

            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
     </property>

</configuration>

//Edit mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml
$ nano mapred-site.xml

 

—mapred-site.xml (/usr/local/hadoop/etc/hadoop)—

//Config Hadoop – mapred-site.xml
<configuration>
       <property>
               <name>mapreduce.jobhistory.webapp.address</name>
               <value>http://localhost:19888</value>
       </property>
</configuration>

1.8) Formatting Namenode

/usr/local/hadoop/etc/hadoop

$ hdfs namenode -format

1.9) Starting Hadoop

//Starting Namenode and Datanode

$ start-dfs.sh

$ jps

//Starting Yarn

$ start-yarn.sh

$ jps

1.10) Accessing Hadoop Web Console (Interface)

In Computer (Remoting) -> Open Browser

Type url -> 54.148.21.22:50070 //public ip address

Ports

HDFS: Namenode Web  50070  Datanodes   50075  Secondarynamenode   50090

Yarn: ResourceManager Web 8088

1.11) Stopping Hadoop

$ stop-yarn.sh

$ stop-dfs.sh

 

Lab 2: Importing Data to HDFS

2.1) Creating Hadoop HDFS Directories and Importing file to Hadoop  

//Start HDFS

$ start-dfs.sh

$ jps //check

//Start Yarn

$ start-yarn.sh

$ jps //check

//Make directory for input and output

/var/hadoop_data/
hdfs dfs -mkdir /inputs
hdfs dfs -mkdir /outputs
cd

//Creating input data

$ nano input_data.txt //(or import file) create data file

—input_data.txt (/)—

//Insert data in input_data.txt for processing (inputs)

//Importing file to Hadoop (Physical file system -> Network file system)

$ hdfs dfs -copyFromLocal ./input_data.txt /inputs/input_data.txt

2.2) Traversing, Retrieving Data from HDFS (Check inputs in HDFS)  
//Review data in Hadoop HDFS

$ hdfs dfs -ls /inputs //show inputs file
$ hdfs dfs -cat /inputs/input_data.txt //show data in input_data.txt file

//Addition -> Delete

Folder in HDFS -> hadoop fs -rmr /outputs/tests //delete tests file
Files in HDFS -> hdfs dfs -rm /inputs/test01.txt  //(or .csv) delete test01.txt file

//Check via Network file system on Web Browser

In Computer (Remoting) -> Open Browser

Type url -> 54.148.21.22:50070 //public ip address

Click Menu -> Utilities -> Browse the file system

Lab 3: Writing MapReduce for Processing in Hadoop
Follow Lab 2

3.1) Writing MapReduce

//Create java file for programming

/
nano WordCount.java

//Java Programming

—WordCount.java—

import java.io.IOException;
import java.util.StringTokenizer;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

 public static class TokenizerMapper
      extends Mapper<Object, Text, Text, IntWritable>{

   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();

   public void map(Object key, Text value, Context context
                   ) throws IOException, InterruptedException {
     StringTokenizer itr = new StringTokenizer(value.toString());
     while (itr.hasMoreTokens()) {
       word.set(itr.nextToken());
       context.write(word, one);
     }
   }
 }

 public static class IntSumReducer
      extends Reducer<Text,IntWritable,Text,IntWritable> {
   private IntWritable result = new IntWritable();

   public void reduce(Text key, Iterable<IntWritable> values,
                      Context context
                      ) throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
       sum += val.get();
     }
     result.set(sum);
     context.write(key, result);
   }
 }

 public static void main(String[] args) throws Exception {
   Configuration conf = new Configuration();
   Job job = Job.getInstance(conf, “word count”);
   job.setJarByClass(WordCount.class);
   job.setMapperClass(TokenizerMapper.class);
   job.setCombinerClass(IntSumReducer.class);
   job.setReducerClass(IntSumReducer.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(IntWritable.class);
   FileInputFormat.addInputPath(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   System.exit(job.waitForCompletion(true) ? 0 : 1);
   }
}

3.2) Packing MapReduce and Deploying to Hadoop Runtime Environment

//Create java classes directory for keep compiled files

/
$ mkdir wordcount_classes

//Compile Java file (Typing in one single line)
$ javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-2.8.0.jar: /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.0.jar: /usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d  wordcount_classes WordCount.java

//See files in wordcount_classes
$ ls ./wordcount_classes/

//Create jar file
$ jar -cvf ./wordcount.jar -C wordcount_classes/ .

//Execute yarn  

$ yarn jar ./wordcount.jar WordCount /inputs/* /outputs/wordcount_output_dir01

//Review the result (show display)
$ hdfs dfs -cat /outputs/wordcount_output_dir01/part-r-00000  

//Check Output files via Network file system on Web Browser

In Computer (Remoting) -> Open Browser

Type url -> 54.148.21.22:50070 //public ip address

Click Menu -> Utilities -> Browse the file system

In inputs and outputs folders -> downloading file use public ip instead of private ip

  •  
  •  
  •  
  •  
  •  
  •  
Aj. NesT The Series on sabemailAj. NesT The Series on sabfacebookAj. NesT The Series on sabgoogleAj. NesT The Series on sabinstagramAj. NesT The Series on sabtwitterAj. NesT The Series on sabyoutube
Aj. NesT The Series
at GlurGeek.Com
Lecturer, Blogger, Traveler, and Software Developer ที่ชอบอ่านบทความใหม่ๆ ตลอดเวลา ชอบหาวิธีสร้าง Inspiration เป็นชีวิตจิตใจ มีความฝันอยากทำ CREATIVE PRODUCT ที่สามารถเปลี่ยนแปลงโลกให้ดีขึ้น และอยากถ่ายรูปสถานที่ท่องเที่ยวรอบโลก สอนหนังสือ ชอบแลกเปลี่ยนความรู้ และเขียน WEBSITE, MOBILE APP, GAME, ETC ที่เป็นประโยชน์กับโลกใบนี้

Leave a Reply