We need a replacement for ScratchDB.

dynamicsofscratch

Redstone1080 wrote:
(#9)
Pufferfish_Test wrote:
have fun finding >6TB (iirc) of storage.
I use a third of that for storing my games, and you can actually go on Amazon and get 8TB hard drives (dunno about SSDs)

sabrent makes 8tb nvme ssds. also, they are NOT CHEAP @ $1199

King of the page:)

Last edited by dynamicsofscratch (March 28, 2024 11:38:05)

aII toasters toast toast, but what happens when there are no longer toasters being produced? will their technology simply become obsolete, with humans moving onto bigger, better things? will toast become a distant memory, written in textbooks of the future as foods us simpler generations ate? who's to say! society is constantly moving, changing, evolving, ideas being built upon, improved upon, theories being proven or disproven. are we but a blip on the timeline? sure, our names may not be remembered, but that's not the point. you can make a change. you can make a difference. you can make the world better, even if you don't know yet. and the first step is to go for it. even if you are afraid of failure. going back to the example of toasters, do you know off the top of your head who invented them? no? have you used one? probably. so, even if you don't remember my name, if I was able to help awnser your question, that is enough. if I was able to help you, even in the slightest way, this could push you to continue with scratch and not give up after the program crashes, and maybe one day learn other programming languages and change the world. everything is a cause and effect reaction, new inventions lead to the technology of the future, and even as the generations of the past are slowly forgotten, their influence lives on to this day, affecting how the world eventually turned out and how it will be for generations to come.

and, without toasters, we wouldn't have toast.

Regards
dynamicsofscratch

Anything above that grey line is a signature!
Also, anything can be put in your signature, (also referred as a siggy) including ads but! You cannot do anything else that violates the community guidelines as, you will be reported and you could be banned/muted.
Computer enthusiast, coder, designer and a offline veteran.
700th post

dynamicsofscratch

I mean, a collaborated and combined effort to make a new DB would theoretically be possible. Here are my pros and cons of making a new DB

Pros

It would reduce the amount of load on ScratchDB, hence improving speed for ScratchDB users.
It would be (most likely) open-source so that issues can be located and code can be updated to fix them
It could be faster, more efficient and be more reliable than ScratchDB
Experimental versions of it could show new features and the DB could be modified for certain use-cases too.

Cons

While reducing the load on ScratchDB, It would increase the load on the main Scratch API (could be worked around though?)
Unspotted vulnerabilities could mean that the DB is more prone to hacking (rather this isn't a con as anything could be hacked)
Increased clutter as everyone would just create their own if making another one was possible and documented.
Rather than create a new DB, make a improved version of the existing DB as it prevent flocking of users from one to other.
Speaking of flocking, users would flood to the new DB and would increase the load on that, meaning switching between DBs.

aII toasters toast toast, but what happens when there are no longer toasters being produced? will their technology simply become obsolete, with humans moving onto bigger, better things? will toast become a distant memory, written in textbooks of the future as foods us simpler generations ate? who's to say! society is constantly moving, changing, evolving, ideas being built upon, improved upon, theories being proven or disproven. are we but a blip on the timeline? sure, our names may not be remembered, but that's not the point. you can make a change. you can make a difference. you can make the world better, even if you don't know yet. and the first step is to go for it. even if you are afraid of failure. going back to the example of toasters, do you know off the top of your head who invented them? no? have you used one? probably. so, even if you don't remember my name, if I was able to help awnser your question, that is enough. if I was able to help you, even in the slightest way, this could push you to continue with scratch and not give up after the program crashes, and maybe one day learn other programming languages and change the world. everything is a cause and effect reaction, new inventions lead to the technology of the future, and even as the generations of the past are slowly forgotten, their influence lives on to this day, affecting how the world eventually turned out and how it will be for generations to come.

and, without toasters, we wouldn't have toast.

Regards
dynamicsofscratch

Anything above that grey line is a signature!
Also, anything can be put in your signature, (also referred as a siggy) including ads but! You cannot do anything else that violates the community guidelines as, you will be reported and you could be banned/muted.
Computer enthusiast, coder, designer and a offline veteran.
700th post

gilbert_given_189

Since ScratchDB is down (so said by Ocular), I'm necro-bumping this topic with an old idea of mine on making a replacement of ScratchDB.

How about collaborating to make a new DB in a much more literal sense? What if we make a decentralized API, where every computer could participate in storing and scraping the site? Each computer (which from this point on I will call a peer) would scrape the site, and then either keeps it for themselves or shares them to their neighbors for them to store. (Or rather wait until one neighbor wants it.)

There are advantages with using this approach:

Unlike centralized databases, decentralized databases doesn't need very much storage, since the storage is spread between peers. In fact, if one peer has data newer than the others, they could update it for redundancy or delete their own, since they knew somebody had it stored on their database.
We wouldn't need that many server-grade equipment. Sure, we would still need some for those that's busy handling queries, but since the peer's work is distributed, any hardware with sufficient space could do the usual scraping and storing tasks. Doubling the peers means halving the requests to the Scratch API for each peer (though not the overall network traffic). Unless if the Brook's law applies.
Since everyone, everywhere could be a peer, we would have something like a CDN. This means the route between a peer and someone that wants to access the database could be significantly reduced, which could make queries faster.
On a large enough network, this database will functionally never have an outage, unless if some big disaster strikes and half of the peers got compromised in one way or another. (Or you know, if the database is deprecated.)
Upgrading storage and to some extent performance would be much easier. On a centralized database, expanding those would be a decision of the server operator. For decentralized databases, that goes to the people volunteering or about to volunteer as a peer.
Since this API involves a lot of people, it encourages the peer source code to be open-source. It's not advantageous for one entity to control something that's driven by the people, for the people, without any cost. Gee, that made me sound like a communist.

However, there are also problems as well.

If not controlled properly, the peers would load the Scratch API way, way more than a single computer. This could result in a full-on DDoS (and yes, I'm using the term correctly). Because of this, peers must agree on how much their request rate should be throttled, or which post they should scrape next, and to some extent when to do it. All so that they don't kill the Scratch servers.
The software would be much more complex. Every time a user queries something, if a peer haven't got that on their database, they have to ask their neighbors if they had one. As such, there would be a lot of chatter between peers. This could slow down queries, though not by any significant amount if the API is coded properly, since the amount of paths connecting between one peer to another is logarithmic to the amount of peers connected.
Larger chatters and a larger network means larger vulnerabilities. Some peer could send in a completely fraudulent or even malicious response, and if they're lucky, every peer wouldn't notice it. And since the peers are so interconnected to each other, this unwelcome response could spread quickly and grow out of control. We could add some security measures to prevent such malice, but those only reduce the risk, not eradicate it.
We would need a lot of computers to make this work. Two peers wouldn't be enough, in fact storage-wise it might be even worse overall if we also have some data for redundancy. This also means we have to invite a lot of people to join as a peer.
Being a peer is not without the responsibilities. If there's a new feature added to the peer software, everyone on the network must upgrade their peers to that version, lest the amount of work between those that upgraded and those that aren't would become uneven. And that is if the software is backwards compatible with older versions of them.
Since every peer stores the database entries, if one peer had a power nap, broke, or even left the network, whatever entries that peer has would become inaccessible. This also makes end-of-life preservation very difficult, since that would involve asking every peer (optimally even ones that left the network) to retrieve all their entries to store them on a big archive somewhere. (This is not really a major problem though if we implemented some redundancy as well.)

I need to balance the list more.

Now, how would we implement this?

For this one, I'll let someone else figure it out, since there's a lot of peer-to-peer architectures that could work. However, without going deep into the decentralization rabbit hole (of which it is deep), here's how I could see it get implemented.

One easy way (though more centralized) method is having a central server or two manage all the queries and every peer's connections and tasks. To be fair, it's not like the peers are only connected to the central server (that would make the network completely centralized), instead the central server only acts as a controller of the peers. We could also have multiple central servers as well, thus making a sort of decentralized network atop another decentralized network.

A better approach is to also allocate some peers as database servers, which processes all the queries as well as scraping and storing data. Everyone else could just choose a server nearby to take queries. Ideally every peer should be a database server, but since finding one peer on the wild would be a challenge, having some is good enough. (Besides, not all of us can afford a domain name, or worse, a static IP address.)

Task management would be tricky on a decentralized network, since every peer cannot directly control others. However, one way I could think of doing this is having a regular audit of the network traffic, either throughout the entire network or localized within a specific place. From the audit, the peers would then adjust their request rate to make sure it doesn't hurt the Scratch servers too much. I'm pretty sure there's other alternatives for this, but this is what I can think of that works.

Does this sound like much effort? Well, if it isn't, somebody would've already implemented one. Is this idea even practical? Does it make sense to have a large, potentially world-wide network of computers, loading another, solo one that isn't even theirs? Probably not. But it is interesting enough to ponder the possibility of such database, if too infeasible to materialize (or moralize for that matter).

Ugh, big block of text once again. I need to stop doing that.

Last edited by gilbert_given_189 (June 18, 2024 17:40:17)

If you see a line above this text, it means that below this text is my signature.
This place is just a memory to me, I may return occasionally but I'm busy.
I guess I'm an ATer now.

I think I may have seasoned my posts a bit too much.
Also, my posts are getting lengthy lately. Whoops.

Colored Pencil is supposed to color the siggy, but Scratch says it's too big.

There is nothing here…

don don pan pan
dondo pan pan

davidtheplatform

gilbert_given_189 wrote:
snip

I think reliability is a pretty big problem.
Let’s say we want all data to be duplicated 3x[1]. Smaller peers = need more total peers to maintain enough duplication. If each peer has 10% of the data, 30 peers would be the minimum. Bigger peers means less total peers needed, but require more expensive hardware per peer so less people will be able to host them. Are there 30 active users who could run peers? I think it’s unlikely.
[1] so if one peer goes down, we can still verify that two sources agree

I’m currently working on a scratchdb clone and the forum data is actually quite small, 36 MB for ~500k records, there will probably end up being ~10m records total (~1GB) so maybe each peer could have most or all of the data, in which case the problem shifts to number of requests. IDK how many requests scratchdb got so I can’t speculate on that. Does anyone know/can ask Lefty about usage stats?

Generation 4: the first time you see this copy and paste it on top of your sig in the scratch forums and increase generation by 1. Social experiment.

EDawg2011

I found some important info at Lefty's website (A lot if not all of the essential info is bolded, and I tried to underline the most important pieces of information.):

ScratchDB Hiatus

Unfortunately in May 2024 ScratchDB's server experienced an unexpected power loss at just the right time to cause database corruption. Since this server was meant as a temporary transition, backups were not routinely made. To add to the issue, the server currently is powered down in an area where it is difficult to restore power the foreseeable future. Unless some contacts are made, the earliest power will be restored to this system is mid-August 2024.

With this information, I unfortunately have to announce that the ScratchDB project will be on an indefinite hiatus, meaning service for v2 or v3 will not be restored, possibly forever unless certain circumstances change.

ScratchDB has been running since September of 2019, and since then has gone through a lot of changes. It started simply being a forum historical archive, then adding forum search, and then a database with projects and users. Its biggest change was the addition of rankings and then rankings over time. When I first started this project fairly early in high school, I had plenty of time to develop it and improve it with user feedback, but unfortunately, now being in my second year of college, it is rather difficult to maintain.

Some Notes

I often read feedback asking why I couldn't just run it in (idk if I can say this on Scratch.) or (idk if I can say this on Scratch.), and some people tried to replicate it there. The unfortunate truth is ScratchDB required a lot of processing power and orchestration, to the point of where it took up some 64GB of RAM to run with database caches and different temporary states. In total, the databases weren't too big though. Excluding project JSON information, the system was somewhere under 100GB, although this information had to constantly be updated which meant the system constantly had to re-check for updates. Storage was never an issue; for the past few years my servers have been running multiple drives over 8TB each. However, the difficulty comes down to stability and uptime.

The server ScratchDB was running on was a Dell R420 with 4x8TB and 128GB of RAM. This system was quite overkill for the project, but I've always had an interest in maintaining server equipment like this. I also didn't use this server exclusively for ScratchDB; I run a lot of home automation systems as well as other personal projects on it. I planned on moving ScratchDB to the new server that this current landing page is running on, which is an HP DL360 G9 in the FMT2 datacenter in (REDACTED). However, due to power difficulties, I wasn't able to make the transition in time.

ScratchDB was always a learning experience and a hobby, much like me trying to maintain these servers. Unfortunately, with learning comes making mistakes, some of those being quite difficult to fix. Database corruption due to power loss happened because I didn't follow certain good practices when I was setting up the temporary system. I didn't really think that would happen to me, but it did. Perhaps some bad luck was involved, but as some have speculated, the project couldn't keep running forever. ScratchDB has been an amazing learning experience for me as well as people who used the project.

When I first got the idea for ScratchDB, I simply wanted to see what forum posts got deleted to try to identify some biases in how moderation was done, but the project very quickly evolved into a project that much of the community used with a much more expanded set of uses. I never expected it to get this big, especially since it was just a database and an API with no front end to directly interface with it, but projects such as (REDACTED) and Ocular came in and made good use of the data. It even grew to the point of where I heard that some ST members were using its search functionality. Unfortunately on the flip side I was also banned from Scratch… not too sure what that's about.

I'm thankful for being able to run the project for so long, even though there were some rough patches in the middle there and it got pretty messy in the end. This project has been on my resume for quite a long time and possibly helped me get into college and possibly helped me get my internships the past two years.

If you would like to get in contact with me, I can be reached at (REDACTED), or on other external chat platforms. I may not respond since I have quite a lot going on currently, but I will hopefully eventually get to responding at some point.

The Future

ScratchDB is one of the few sources that has historical forum and user data, and I never wanted that information to go into the dark. I will keep trying to restore the database so I can get a dump of them. Once I am able to recover a decent amount of information, I will provide download links to certain chunks of information, such as follower counts and all non-deleted forum posts and topics.

In order to not violate certain privacy laws and practices, I will not be providing the ability to obtain deleted forum posts or forum posts' historical edits since some of those may contain personal information or information people want deleted. I believe in the right to have your data removed from the internet, however I am unable to accomidate deletion requests after the archive is made.

In order to ensure relevancy and accuracy, I will likely re-sweep the forums, so in order to be excluded from any database dump (even though something like the (I don't think I can say this on Scratch, so I censored it.) already has it) your posts will need to be modified on your side.

Conclusion

As I'm writing this I am sitting in my apartment in (I might be censoring things too much, but I don't want to dox anyone.) for my current summer internship. I never believed I would make it to where I am today, and ScratchDB definitely helped me get there. From getting me into (REDACTED), to getting me a freshman internship at (REDACTED), I'm not too sure if the project directly helped or just gave me the skills I needed to get here. But thank you for using ScratchDB, and hopefully you've learned something from it as well.

I don't want this to be the last project I do involving large sums of data, so whenever I start something new I'll update this page with what's going on. I may also post a more detailed write-up of how ScratchDB works. But for now I need to focus on my internship and my school work.

Scratch on!

-Lefty

Updated 19 June 2024, 17:15 PST

Please note that I may sometimes make a mistake and give wrong information.

Can you please put this at/near the top of your signature and tell people that tag spam isn't allowed and it manipulates the algorithm, to start a chain and spread the word? - Thanks, @EDawg2011.

But then I had a very good idea. I used F5. See, using F5 gave me a whole new perspective and I was able to see a chest I couldn't have seen before.

(Highlight text + down arrow + shift to see the rest of my signature.)

Help find out who ate @cheddargirl's signature! l JOIN WORLD DOMINATION INC. TODAY! l Donate your soul! l me when i accidentally spread misinformation l Platformer Skibidi

<0-0::sensing>//This is Charles; he protects my signature from evil kumquats.
when I'm spawned::events hat//This is the code Charles' brain runs on.
forever
if <[100] > (distance to [an evil kumquat v])> then
delete the evil kumquat::control
end
end

Be moist.

Mryellowdoggy

I've been working on a web-scraper for Scratch (particularly for the forums currently) and it will probably be able to get all the data it needs in a few days.

I have a cool blog about Scratch stuff (check it out!): https://www.mryellowdog.com/

when green flag clicked
if <addicted to scratch> then
go to [outside v]
if <touching [grass v] ?> then
Make a scratch project
end
end

Waakul

I'm having a server, i can host it if someone has the source code.

Sid72020123

I can give a direct access to the SUI's DataBase to fetch the usernames data. So, the new ScratchDB replacement can index those…

Sid72020123

gilbert_given_189

davidtheplatform wrote:
gilbert_given_189 wrote:
snip
I think reliability is a pretty big problem.
Let’s say we want all data to be duplicated 3x[1]. Smaller peers = need more total peers to maintain enough duplication. If each peer has 10% of the data, 30 peers would be the minimum. Bigger peers means less total peers needed, but require more expensive hardware per peer so less people will be able to host them. Are there 30 active users who could run peers? I think it’s unlikely.
[1] so if one peer goes down, we can still verify that two sources agree

At least show the point you want to reply on…
I did mention requiring a lot of peers, but I didn't expect the problem to be that critical. Especially since I can only count half a dozen of Scratchers that might be candidate for peers. (Make it about 12 for potentially interested peers that's okay for not being active all the time)
Oh well.

If you see a line above this text, it means that below this text is my signature.
This place is just a memory to me, I may return occasionally but I'm busy.
I guess I'm an ATer now.

I think I may have seasoned my posts a bit too much.
Also, my posts are getting lengthy lately. Whoops.

Colored Pencil is supposed to color the siggy, but Scratch says it's too big.

There is nothing here…

don don pan pan
dondo pan pan

Waakul

I contacted lefty, he said the problem isn't with scratchdb itself but it's server. It just stopped working. The server was pretty weak and thus decided to move scratchdb to a server in his friends database. but as it stopped working he cant get that data out of it. He's trying to somehow fix it and get the data out of it. scratchdb usually glitches here and then once in while and goes down, that's why he wanted to move to a new server.

Hello,

Thanks for reaching out! Unfortunately, the issue isn’t with ScratchDB itself, but rather the system it was running on. I’m not able to turn it on easily right now. I had planned to move it to my new server in the datacenter, but I can’t recover the data until I get the original server turned back on. One thing that made ScratchDB a bit easier to run was that I could do it entirely for free on some hardware I had lying around, and now in the datacenter thanks to a friend. So, unfortunately, using something like GCP or AWS would be out of my budget. Thanks for the suggestions, though! I’d love to get it back online in some way, but I’m pretty busy right now.

Best,
Lefty

Last edited by Waakul (June 21, 2024 17:43:31)

davidtheplatform

gilbert_given_189 wrote:
davidtheplatform wrote:
gilbert_given_189 wrote:
snip
I think reliability is a pretty big problem.
Let’s say we want all data to be duplicated 3x[1]. Smaller peers = need more total peers to maintain enough duplication. If each peer has 10% of the data, 30 peers would be the minimum. Bigger peers means less total peers needed, but require more expensive hardware per peer so less people will be able to host them. Are there 30 active users who could run peers? I think it’s unlikely.
[1] so if one peer goes down, we can still verify that two sources agree
At least show the point you want to reply on…
I did mention requiring a lot of peers, but I didn't expect the problem to be that critical. Especially since I can only count half a dozen of Scratchers that might be candidate for peers. (Make it about 12 for potentially interested peers that's okay for not being active all the time)
Oh well.

A small number of peers that each have all the data could still work, and would probably be more reliable than 1, and then the number of peers doesn’t matter as long as it’s more than 0

Generation 4: the first time you see this copy and paste it on top of your sig in the scratch forums and increase generation by 1. Social experiment.

Mryellowdoggy

Sid72020123 wrote:
I can give a direct access to the SUI's DataBase to fetch the usernames data. So, the new ScratchDB replacement can index those…

that would be helpful since it's not an easy task to get all of the users

Last edited by Mryellowdoggy (June 21, 2024 21:59:08)

I have a cool blog about Scratch stuff (check it out!): https://www.mryellowdog.com/

when green flag clicked
if <addicted to scratch> then
go to [outside v]
if <touching [grass v] ?> then
Make a scratch project
end
end

Discuss Scratch