Sol 25

Common Crawl Dataset on HuggingFace

Posted Sep 30, 2024 Updated Oct 1, 2024

By Gautam Menon

1 min read

Forgot I had a blog.

We finally got together our CommonCrawl data processed on a Hetzner dedicated server with Apache Spark. Although this definitely took a little longer than we’d hoped, the results aren’t too shabby.

Checkout the dataset here: BigBanyanTree CommonCrawl Data

Also checkout the related code here: BigBanyanTree GitHub

In total, we extracted 315M (yes, that’s M for million) rows of data, from 1% random samples of 7 years worth of CommonCrawl data dumps, spanning 2018 all the way through 2024. To put it in terabytes, we processed approximately 31.5 TBs of data, to get 315M rows of data (any relation between 315 & 31.5 is coincidental here). In terms of production grade data engineering this is indeed quite miniscule, but the process is what matters here. Given the resources, our process can be scaled ad infinitum, should some kind soul give us the funds to do it.

In other news (gosh I sound like a reporter), I discovered a couple of interesting libraries and AI startups. Without going into much detail, I’ll just list them out here, take your pickings:

mutable.ai - Chat with Git repos, and auto-generate wiki’s complete with diagrams and flowcharts
cuDF - Enables pandas-like analysis coupled with CUDA to make data go brrrr
RAPIDS - extensive data analysis ecosystem with CUDA support
continue.dev - open-source AI assistant, plugins with both VSCode and JetBrains IDEs

Lastly, I really really want to learn CUDA programming, but can’t find any great projects to get me going. I guess the quest will have to continue a bit longer.

Data Engineering

This post is licensed under CC BY 4.0 by the author.

Trending Tags