Data Wrangling on 4Chan Dataset with Spark on EMR

3 min readNov 29, 2020

This post discusses data engineering with a concrete example. The dataset was produced by Antonis Papasavva et al. A pyspark notebook is provided.

Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect…

This is the dataset released with the paper titled: "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from…

zenodo.org

The dataset is about 27GB in zstd-compressed format. I followed the instructions provided above and copied the uncompressed file to S3.

Next, I wanted to convert the dataset to a format suitable for analysis with popular cloud data analysis tools. I chose to focus on Spark and Athena.

Spark and Athena have many similarities: they both can handle large data sets, use more or less the same file formats (e.g. parquet) and when using SQL, use similar syntax.

The main difference is that Spark is more suitable for data wrangling. I prefer to split up individual steps in a notebook and be able to explain them step by step, rather than having a gigantic SQL query trying to do the same thing with many CTEs (common table expressions).

For Spark, I used an EMR cluster and attached a notebook to it.

The idea is that the EMR cluster is only needed to do the data wrangling, and once the data is correctly partitioned and saved on S3, we can terminate the cluster and do the rest of the analysis in Athena.

Athena uses Glue as the metastore. By having Spark connected to Glue as shown above, I did not have to use the Glue crawlers, all my data wrangling showed up straight away in the Glue catalog and the tables were ready to query in Athena.

Athena has the advantage of being “pay as you go”. You pay per GB of data scanned. Again, partitioning and columnar data formats help reduce costs. For example, consider this query:

In the JSON format, the query scans 2.8GB. (At least, only one 31 files, one for each day of January, 2018.)

The same query only scans 2.4MB (that is, 1/1000) using the columnar, binary format parquet. It runs much faster, although the difference will only become noticeable once we work with much larger datasets.

Finally, I also documented my dataset in Alation (where I work as a Consultant). For that, I converted the PDF with the data dictionary to a CSV and uploaded it to the the Athena data source which I had set up in Alation. Now my colleagues can also work on this data set and know what each column means. They can also find my queries and share their own.

If you want to see the code, head over here:

mafux777/Alation_Article

Permalink GitHub is home to over 50 million developers working together to host and review code, manage projects, and…

github.com

Data Wrangling on 4Chan Dataset with Spark on EMR

Dataset: Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect…

This is the dataset released with the paper titled: "Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from…

mafux777/Alation_Article

Permalink GitHub is home to over 50 million developers working together to host and review code, manage projects, and…

Written by Matthias Funke