PLACEHOLDER - Stack Exchange data dump downloader and transformer

Ask Question

Asked 1 year, 6 months ago

Modified 1 year, 6 months ago

Viewed 43 times

NOTE: This app will remain a placeholder with general scaffolding for downloading and processing until the new data dump format is out. It's not possible to work on the exact downloader and transformer until the new data dump format and process are out.

At the time of writing, it's still unclear whether or not the data dump is exactly the same as it is today, just in a new place, or if it's both in a different place and formatted differently. This particularly applies to Stack Overflow. On archive.org, Stack Overflow's data dump is split across several .7z files (with MSO still being a monolith), but the current system seems to suggest there's only one download for MSO + SO. Whether that's breaking compared to today's data dump is unclear and something we'll have to find out as we go. Parts of the system do, however, assume that .7z and XML files are still used to minimise the amount of time between the release of the data dump, and the restoration of the community's ability to effectively archive our own data.

About

With Stack Exchange, Inc.'s recent data dump restrictions, the official SE data dump on archive.org is dead, and single-click downloads of the entire data dump are too.

This program is meant to auto-download and auto-transform the data dump into one of (hopefully) several more convenient formats, and to contribute to data archival efforts.

Download

Download from GitHub. Instructions for setting up and running are listed in the README

Platform

Downloader: Theoretically anything able to use Python 3.10 and newer; only verified on Linux Mint 21.3 with Python 3.10.12 at the time of writing

Transformer: Theoretically anything able to compile C++20; Docker will be supported in the future to simplify the dependencies, which means anything able to run Docker will also be able to run the transformer

Contact

Use GitHub to report any issues. General questions can be posted on GitHub Discussions.

License

The code is under the MIT license; see the LICENSE file.

The data downloaded and produced is under various versions of CC-By-SA, as per Stack Exchange's licensing rules, possibly in addition to whatever extra rules they try to impose on the data dump.

Code

The code is available on GitHub, and is split in two components:

The downloader: Python-based. When the data dump is out, this will deal with account creation and data dump download on a per-site basis.
The transformer: C++-based. Takes the data dump as input, and outputs it in an alternate format. At the time of writing (which is prior to the release of the data dump), no formats have actually been implemented.

I'll only be implementing JSON and maybe SQLite myself, but the system is (theoretically) set up to allow for any arbitrary format to be used. Pull requests adding formats are welcome after the data dump is out and we know the exact format and structure it's in. I'd rather not have to mass-refactor transformers because it turns out the data dump input format is completely different. There's also more ground work to be done before any transformers can be implemented, particularly with type and output management

edited Jul 14, 2024 at 21:26

asked Jul 14, 2024 at 18:05

Zoe - Save the data dump

3431 silver badge9 bronze badges

Add a comment |

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.