Bourne's Blog - A Full-stack & Web3 Developer

「big data era」

Word counting by spark cluster

1 Create project 1.1 pom.xml Create a maven project in IDEA, and put the following content in pom.xml: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 22, 2022

Spark Tasks

Task 1 A bookstore selling the following number of books per day in the last N days, calculate the average daily sales of each book. Day 1: (“spark”,2), (“hadoop”,6), Day 2: (“hadoop”,4),(“spark”...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 20, 2022

Flume Practice

A simple example create a configuration file in conf, named to ‘example.conf’: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # example.conf: A single-node Flume configuration # Name ...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 13, 2022

Sqoop2 Practice

1 Install and configuration 1.1 extract the tar ball to /opt/module, set SQOOP_HOME in /etc/profile 1 2 3 4 [root@hadoop001 sqoop-1.99.7-bin-hadoop200]# echo export SQOOP_HOME=`pwd` export SQOOP_HO...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 11, 2022

Clickhouse -- Brown University Benchmark

1. Clickhouse - Brown University Benchmark This chapter followed the instruction of Brown University Benchmark 1.1 Downlaod data wget is too slow, so I use axel to open 5 thread to download the da...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 8, 2022

Spark Running Mode

graph LR; A1[Running Mode] --> local[fa:fa-laptop local]; local --> l1[local 1: executors number = 1]; local --> ln[local N: executors number = N]; local --> lm[local *: executors numbe...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 6, 2022

Spark RDD - PySpark Word Count

Spark RDD - PySpark Word Count 1. Prepare spark context 1 2 from pyspark import SparkContext sc = SparkContext("local","PySpark Word Count Exmaple") 1 2 /usr/local/lib/python3.6/site-packages/pysp...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 1, 2022

Spark RDD - Spark Shell Word Count

Spark RDD - Spark Shell several ways to do word count find the top 10 ranking words used in an article 1. upload news file to hdfs 1 2 3 [root@hadoop001 ~]# hdfs dfs -put news.txt /dir1 [root@hado...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on June 1, 2022

Spark RDD Usage - Part 1

Spark RDD Usage - Part 1 Overall RDD 是Spark用于对分布式数据进行抽象的模型，用于封装所有内存和磁盘中的分布式数据实体。为了解决开发人员在大规模的集群中以一种容错的方式进行内存计算，提出的概念。 RDD即Resilient Distributed Dataset, 弹性分布式数据集合，是Spark中最基本的数据抽象，代表一个不可变、只读的、被分区的...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on May 31, 2022

Azkaban Usage - Part 1

Azkaban is a distributed Workflow Manager, usually used to solve the problem of hadoop job dependencies. 1. Install Download the latest azkaban(4.0 now) from https://github.com/azkaban/azkaban/re...

Posted by Bourne's Blog - A Full-stack & Web3 Developer on May 24, 2022