VERY IMPORTANT: SCRACTHDB CLONE RESEARCH

i_eat_coffee

hello people of scratch
it is I, the infamous coffee eater, who has returned from my short break as i struggled to continue eating coffee, because it melted since it's so hot outside

anyway, i've been working on a scratch DB clone, and I've got about 0.000104858 Terabytes* worth of forum post data
no user data, just forum posts

also i forgot what scratchdb was, but i think it was a database of scratch info
so i did that, but with the forums

and i need you guys to help me by letting me know what endpoints exactly do you need because i have no idea what scratchdb had before
(i cant see the docs because they were replaced by a hiatus message on the website)

once i code them i'll publish the website
also you'll have to wait a bit because i want approval from the scratch team before sharing it but still

davidtheplatform

I'm working on a similar thing but I think i have more stuff done, I have most topics scraped and ~1m posts (out of ~6m total). Also some of the endpoints work. anyways we could work together possibly

relevant emails:

I wrote:
Do you have the v3 docs for scratchdb (/v3/docs/)? Archive.org/archive.is couldn't save them unfortunately.
datonelefty wrote:
I still have the docker images somewhere so I’ll be able to keep an archive somewhere at some point, however currently they aren’t hosted anywhere. Once I get everything back online to try to do some recovery I’ll see what I can do about publishing a bit of an archive.

dumorando

the suffering will not end

Jeffalo

hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

dynamicsofscratch

My proposal for a ScratchDB clone is this for coffee:

Instead of creating multiple ScratchDB clones which would hurt the Scratch website, resulting in high downtime, slow loadtimes, etc, create a project which everyone can join and contribute what they can to make a new ScratchDB, and then abandon it (ScratchDB). This would make only one scraper be on the website, reducing clogging of the website.

Voyager, a project by @josueart is to replace ScratchDB, and is yet to make a prototype as we don't have much experience with web-scraping. I am actively trying to learn it, and make a working prototype which scrapes the entire forums and organizes it in a database. coffee is very much welcome to join our project, and as @Jeffalo said:

Jeffalo wrote:
(#4)
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

i_eat_coffee

Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

^^

some time ago I spoke with one of the scratch website engineers, they said it's okay to send requests to the forums as long as it's really really slow to ensure that the website is unaffected

for the record, i started gathering data in february: https://scratch.mit.edu/discuss/post/7793231/

i_eat_coffee

how the DB works (at the moment):
it's a huge database with the following data format for each post (they are not categorised in any way, just stored as-is)

“POST ID”: {
POST ID,
DATE POSTED,
TOPIC ID,
author: {
USERNAME
PICTURE
ID
},
POST CONTENT (HTML)
}

dynamicsofscratch

i_eat_coffee wrote:
(#6)
Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.
^^

some time ago I spoke with one of the scratch website engineers, they said it's okay to send requests to the forums as long as it's really really slow to ensure that the website is unaffected

for the record, i started gathering data in february: https://scratch.mit.edu/discuss/post/7793231/

any scraper right now affects the website, as the website is really fragile and needs maintainence (which they cancelled…)

i_eat_coffee

https://scratch.mit.edu/discuss/topic/770794/

ajskateboarder

dynamicsofscratch wrote:
i_eat_coffee wrote:
(#6)
Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.
^^

some time ago I spoke with one of the scratch website engineers, they said it's okay to send requests to the forums as long as it's really really slow to ensure that the website is unaffected

for the record, i started gathering data in february: https://scratch.mit.edu/discuss/post/7793231/
any scraper right now affects the website, as the website is really fragile and needs maintainence (which they cancelled…)

Surely a scraper where the last post scraped was made over half a year ago couldn't affect the website in comparison to existing user traffic, right? I do agree though that any scraping efforts should be throttled intensively, just as i_eat_coffee's scraper is doing

josueart

Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

dynamicsofscratch wrote:
Instead of creating multiple ScratchDB clones which would hurt the Scratch website, resulting in high downtime, slow loadtimes, etc, create a project which everyone can join and contribute what they can to make a new ScratchDB, and then abandon it (ScratchDB). This would make only one scraper be on the website, reducing clogging of the website.

I love this idea, but how would be coordinate? Maybe this topic could help if OP redirected the theme.

i_eat_coffee

josueart wrote:
Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

dynamicsofscratch wrote:
Instead of creating multiple ScratchDB clones which would hurt the Scratch website, resulting in high downtime, slow loadtimes, etc, create a project which everyone can join and contribute what they can to make a new ScratchDB, and then abandon it (ScratchDB). This would make only one scraper be on the website, reducing clogging of the website.

I love this idea, but how would be coordinate? Maybe this topic could help if OP redirected the theme.

that's kind of.. the point of the current website (link)
people contribute by manually scraping scratch posts via the website, reducing the load on scratch servers

josueart

i_eat_coffee wrote:
josueart wrote:
Jeffalo wrote:
hey all! please try to consolidate your efforts and be courteous with your scraping!!! i want a scratchdb alternative as much as you do, but multiple people mass scraping the most expensive and fragile part of the website is really not helping anyone.

tiny reminder that scratch is a free service. these scrapers play a large part in disrupting the service for everyone else. let’s please be careful with this.

dynamicsofscratch wrote:
Instead of creating multiple ScratchDB clones which would hurt the Scratch website, resulting in high downtime, slow loadtimes, etc, create a project which everyone can join and contribute what they can to make a new ScratchDB, and then abandon it (ScratchDB). This would make only one scraper be on the website, reducing clogging of the website.

I love this idea, but how would be coordinate? Maybe this topic could help if OP redirected the theme.
that's kind of.. the point of the current website (link)
people contribute by manually scraping scratch posts via the website, reducing the load on scratch servers

This is probably not a good idea though. I don't think people will be able to index every single topic on Scratch, and this process could easily be automated.

We could reduce the load by limiting the workers (Scratch's recommendation for the API is 10req/s, but the forums probably wouldn't handle this, so 5req/s?).

In the end, the fragility is caused by how old and outdated the forums are

cosmosaura

Topic closed on request from OP. If you need it re-opened, though, you can report this and ask.

Discuss Scratch