Sol 25
Common Crawl Dataset on HuggingFace
Forgot I had a blog.
We finally got together our CommonCrawl data processed on a Hetzner dedicated server with Apache Spark. Although this definitely took a little longer than we’d hoped, the results aren’t too shabby.
Checkout the dataset here: BigBanyanTree CommonCrawl Data
Also checkout the related code here: BigBanyanTree GitHub
In total, we extracted 315M (yes, that’s M for million) rows of data, from 1% random samples of 7 years worth of CommonCrawl data dumps, spanning 2018 all the way through 2024. To put it in terabytes, we processed approximately 31.5 TBs of data, to get 315M rows of data (any relation between 315 & 31.5 is coincidental here). In terms of production grade data engineering this is indeed quite miniscule, but the process is what matters here. Given the resources, our process can be scaled ad infinitum, should some kind soul give us the funds to do it.
In other news (gosh I sound like a reporter), I discovered a couple of interesting libraries and AI startups. Without going into much detail, I’ll just list them out here, take your pickings:
- mutable.ai - Chat with Git repos, and auto-generate wiki’s complete with diagrams and flowcharts
- cuDF - Enables pandas-like analysis coupled with CUDA to make data go brrrr
- RAPIDS - extensive data analysis ecosystem with CUDA support
- continue.dev - open-source AI assistant, plugins with both VSCode and JetBrains IDEs
Lastly, I really really want to learn CUDA programming, but can’t find any great projects to get me going. I guess the quest will have to continue a bit longer.