Bourne's Blog - A Full-stack & Web3 Developer

「big data era」

Word counting by spark cluster

1 Create project 1.1 pom.xml Create a maven project in IDEA, and put the following content in pom.xml: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3...

Spark Tasks

Task 1 A bookstore selling the following number of books per day in the last N days, calculate the average daily sales of each book. Day 1: (“spark”,2), (“hadoop”,6), Day 2: (“hadoop”,4),(“spark”...

Flume Practice

A simple example create a configuration file in conf, named to ‘example.conf’: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # example.conf: A single-node Flume configuration # Name ...

Sqoop2 Practice

1 Install and configuration 1.1 extract the tar ball to /opt/module, set SQOOP_HOME in /etc/profile 1 2 3 4 [root@hadoop001 sqoop-1.99.7-bin-hadoop200]# echo export SQOOP_HOME=`pwd` export SQOOP_HO...

Clickhouse -- Brown University Benchmark

1. Clickhouse - Brown University Benchmark This chapter followed the instruction of Brown University Benchmark 1.1 Downlaod data wget is too slow, so I use axel to open 5 thread to download the da...

Spark Running Mode

graph LR; A1[Running Mode] --> local[fa:fa-laptop local]; local --> l1[local 1: executors number = 1]; local --> ln[local N: executors number = N]; local --> lm[local *: executors numbe...

Spark RDD - PySpark Word Count

Spark RDD - PySpark Word Count 1. Prepare spark context 1 2 from pyspark import SparkContext sc = SparkContext("local","PySpark Word Count Exmaple") 1 2 /usr/local/lib/python3.6/site-packages/pysp...

Spark RDD - Spark Shell Word Count

Spark RDD - Spark Shell several ways to do word count find the top 10 ranking words used in an article 1. upload news file to hdfs 1 2 3 [root@hadoop001 ~]# hdfs dfs -put news.txt /dir1 [root@hado...

Spark RDD Usage - Part 1

Spark RDD Usage - Part 1 Overall RDD 是Spark用于对分布式数据进行抽象的模型,用于封装所有内存和磁盘中的分布式数据实体。 为了解决开发人员在大规模的集群中以一种容错的方式进行内存计算,提出的概念。 RDD即Resilient Distributed Dataset, 弹性分布式数据集合,是Spark中最基本的数据抽象,代表一个不可变、只读的、被分区的...

Azkaban Usage - Part 1

Azkaban is a distributed Workflow Manager, usually used to solve the problem of hadoop job dependencies. 1. Install Download the latest azkaban(4.0 now) from https://github.com/azkaban/azkaban/re...