Redditor “Stuck_in_the_Matrix” has posted a torrent of what he claims is a dataset of every publicly available comment on Reddit.
That’s 1.7 billion comments total, with data about the author, subreddit, position in the comment tree, and comment score for each post. “This dataset is over 1 terabyte uncompressed, so this would be best for larger research projects,” wrote redditor and dataset compiler “Stuck_in_the_Matrix.”
The redditor first posted about the dataset on subreddit r/datasets (of course) on July 3rd, and with some help from other users, had set up a torrent by July 4. A smaller dataset, comprising just a month’s worth of comments, is also available as a torrent.
What could you do with all that data? “Give me 5 good data scientists and we can find the holy grail of karma!” said user “kill-init.”
Reddit user “mattrepl,” who identified themselves as a PhD student in machine learning and community dynamics, suggested that the dataset could be used to develop models of the flow of online conversations or the spread of Internet memes — a topic that sociologists have paid increasing attention to over the last few years. It could also be used to predict which subreddits or comment threads a user might participate in, which could help develop better recommendation systems.
All of that data is available through Reddit’s API, but according to other redditors in r/datasets, gathering it all would have been a dauntingly tedious task. “I’ve played with Reddit’s API some and have written crawlers to get data by user, sub, thread, etc. But it becomes prohibitive to get all the data if you have to continuously make requests for relatively small amounts of data and then piece them together,” wrote user “rePAN6517” in a comment.
And others are openly skeptical of the dataset. One commenter, “lost_file,” claimed, “Reddit has a policy for the amount of requests you can make per second. This dataset would have taken at least a year to compile. Something is fishy.”
As of the time of publication, “Stuck_in_the_Matrix” hasn’t responded to those questions.
Top image: Getty Images.