Discuss Scratch
- Discussion Forums
- » Advanced Topics
- » Getting a list of topics with specific requirements
- imfh
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
Words cannot express my gratitude!It’s no problem! I enjoy writing programs to do things like this.
Last edited by imfh (Oct. 28, 2020 00:46:20)
- imfh
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
In order to run the program, you'll need Python 3.6+ and the requests module. If you don't have Python 3.6+, I can change it pretty easily to work with an earlier version. The requests module can be installed using pip. Basically, open the command line and type: “py -m pip install –user requests” (–user can be left off if you have admin and want it for all users).
There are two parts to the program. If you run part 1 and then part 2, it will (as currently setup) make a file called “DB_List.csv” which contains a list of every topic on the first 10 pages of suggestions.
Part 1 gets each page of the Suggestions forum mobile view, saves the html to a zip file, and copies the important info into a csv file. You can adjust the start and end pages using START_PAGE and END_PAGE at the top of the file.
Part 2 reads the csv file of Part 1 to get the topicid and replies. It then gets each topic from ScratchDB, saves the json to a zip file, and copies the important info into another csv file. You can also adjust which topics it gets from ScratchDB, but I have it set to just get everything from part 1.
Both files also have delay options to try and be nicer to the servers. You should be able to lower them if you want it to go faster.
You might want to copy the code by after clicking quote since the code box removes extra newlines. It should still work the same, but it makes everything more readable.
file 1:
file 2:
There are two parts to the program. If you run part 1 and then part 2, it will (as currently setup) make a file called “DB_List.csv” which contains a list of every topic on the first 10 pages of suggestions.
Part 1 gets each page of the Suggestions forum mobile view, saves the html to a zip file, and copies the important info into a csv file. You can adjust the start and end pages using START_PAGE and END_PAGE at the top of the file.
Part 2 reads the csv file of Part 1 to get the topicid and replies. It then gets each topic from ScratchDB, saves the json to a zip file, and copies the important info into another csv file. You can also adjust which topics it gets from ScratchDB, but I have it set to just get everything from part 1.
Both files also have delay options to try and be nicer to the servers. You should be able to lower them if you want it to go faster.
You might want to copy the code by after clicking quote since the code box removes extra newlines. It should still work the same, but it makes everything more readable.
file 1:
""" The exported CSV data is in this order: replies, open? (1/0), topic id, title The items are seperated by tabs and newlines. All tabs and newlines are removed from the title. START_PAGE - The first page of mobile view to retrieve END_PAGE - The last page to retrieve OUTPUT_FILE - The output CSV path ZIP_PATH - Where to save the output zip HEADER - Let Scratch know where the requests are from (sort of) URL - Path to the mobile view REQUEST_DELAY - Base delay to wait ADAPTIVE_DELAY - Multiplied by server response time PATTERN - Regex used to parse the mobile view html """ import re import time from html import unescape from zipfile import ZipFile import requests # Can be installed from pip START_PAGE = 1 END_PAGE = 10 OUTPUT_FILE = "MobileList.csv" ZIP_PATH = f"MobilePages{START_PAGE}-{END_PAGE}.zip" HEADER = {'User-Agent': 'SuggestionsIndexing'} URL = "https://scratch.mit.edu/discuss/m/1/?page=" # delay = request delay + (adaptive delay * response time) REQUEST_DELAY = 2 ADAPTIVE_DELAY = 1 PATTERN = (r"(closed \"\>)?(?# Closed Topic)" r"\s+\<a class\=\"item\" href\=" r"\"\/discuss\/m\/topic\/(\d+)(?# Topic Id)" r"\/\"\>\s+\<strong\>(.+?)(?# Topic Title)" r"\<\/strong\>\s+\<span\>(\d+)(?# Topic Replies)" r" replies\<\/span\>\s+\<\/a\>") # Example segment of HTML # <a class="item" href="/discuss/m/topic/343602/"> # <strong>The Official List of Rejected Suggestions</strong> # <span>2573 replies</span> # </a> def main(): """Run the program""" with open(OUTPUT_FILE, 'a', encoding='utf-8') as output: with ZipFile(ZIP_PATH, 'a') as zipfile: for number in range(START_PAGE, END_PAGE + 1): # Download the page text, delay = get_page(number, zipfile) # Parse the page data = parse_page(text) # Save the page to the file output.write(data) # Output status info print(f"Read page {number} in {delay:.2f}s.", f"Delay: {REQUEST_DELAY + ADAPTIVE_DELAY * delay:.2f}") # Wait for a delay time.sleep(REQUEST_DELAY) time.sleep(ADAPTIVE_DELAY * delay) def get_page(number, zipfile): """Gets a page from the URL and saves it to the zip""" resp = requests.get(URL + str(number), headers=HEADER) zipfile.writestr(f"p{number}.html", resp.content) return unescape(resp.text), resp.elapsed.total_seconds() def parse_page(text): """Extracts data from a page into csv format""" result = [] for match in re.finditer(PATTERN, text): # Get data from each group closed, topic, title, replies = match.groups() # Format the data closed = '0' if closed else '1' title = re.sub(r"\t*\n*", "", title) # remove \t and \n # Add the data to the result list result.append('\t'.join((replies, closed, topic, title))) return '\n'.join(result) + '\n' # def get_dummy(_): # """Get a saved page for testing purposes""" # with open("page1.html", 'r') as file: # return file.read(), 0 if __name__ == "__main__": main()
file 2:
""" This program gets information about a list of topic ids from ScratchDB. Consolidated information is saved in a json, and all retrieved info (including the first 50 posts) is saved to a zip. START_TOPIC - The first topic in the list to read END_TOPIC - The last topic in the list to read. Also stops if it runs out of topics. INPUT_LIST - The path to the topic_list zip OUTPUT_FILE - An output CSV of consolidated useful information OUTPUT_ZIP - A data retrieved from ScratchDB is saved here so it doesn't have to be retrieved again for more info INPUT_ZIP - Used if reading data retrieved from ScratchDB previously rather than getting it from the server again HEADER - Let ScratchDB know where the requests come from (sort of). URL - The url to ScratchDB REQUEST_DELAY - Delay between requests, in seconds ADAPTIVE_DELAY - Additional delay, multiplied by server work time """ import json import math import re import time from zipfile import ZipFile import requests # Can be installed from pip START_TOPIC = 0 # Skip this many END_TOPIC = float('inf') # Stop after this many INPUT_LIST = "MobileList.csv" OUTPUT_FILE = "DB_List.csv" OUTPUT_ZIP = f"DB_data.zip" INPUT_ZIP = "DB_data.zip" # zip made by this program HEADER = {'User-Agent': 'SuggestionIndexing'} URL = "https://scratchdb.lefty.one/v2/forum/topic/" REQUEST_DELAY = 0 ADAPTIVE_DELAY = 2 MAX_TIME = 15 # Max time before timeout TIMEOUT_DELAY = 5 # Time to wait before retrying def main(): """Run the program""" # Download the pages data = get_topics() # Read the pages from a zip # data = read_topics() # Save result to a CSV parse_topics(data) # Format result to bbcode? # parse_topics2(data) def get_topics(): """An iterator to get topics from ScratchDB""" with open(INPUT_LIST, 'r', encoding="utf-8") as topics: with ZipFile(OUTPUT_ZIP, 'w') as zipfile: for i, line in enumerate(topics): # Skip the first topics if i < START_TOPIC - 1: continue # Stop early if i > END_TOPIC - 1: break replies, _, topicid = line.split()[0:3] # Guess the DB page number page = math.ceil(int(replies) / 50) - 1 if page < 0: continue # 0 replies # Get the page topic, content = get_topic( topicid, math.ceil(int(replies) / 50) - 1) topic['replies'] = replies # Ignore empty topics if len(topic['posts']) == 0: continue # Save it to the zipfile zipfile.writestr(f"{replies}-{topicid}.json", content) # Yield the topic from the iterator yield topic def get_topic(topicid, page, previous=None): """Gets the db topic, writes it to the zip, and adds replies""" # To get the oldest post on a topic, we have to get the # last page. To guess the last page, take the number of # posts / 50. If the guess is too high and no posts are # returned, try the previous page. If the guess is possibly # too low and 50 posts are returned, try the next page. # Grab the page resp = request_timeout(f"{URL}{topicid}/{page}") topic = resp.json() # Output status info delay = topic['query_time'] print(f"Retrieved topic {topicid}/{page} in {delay}ms.", f"Delay: {REQUEST_DELAY + ADAPTIVE_DELAY * delay / 1000:.2f}") # Wait for a delay time.sleep(REQUEST_DELAY) time.sleep(ADAPTIVE_DELAY * delay / 1000) # Check if the guessed page was too high if len(topic['posts']) == 0: return previous or (page > 1 and get_topic(topicid, page - 1)) # Check if the guessed page was too low if len(topic['posts']) == 50: # Posts may have been added return get_topic(topicid, page + 1, (topic, resp.content)) return topic, resp.content def request_timeout(url): """Sends a request and handles timeouts""" while True: try: return requests.get(url, headers=HEADER, timeout=MAX_TIME) except requests.exceptions.Timeout: print(f"Timeout for '{url}' Delay: {TIMEOUT_DELAY}") time.sleep(TIMEOUT_DELAY) def read_topics(): """An iterator to read topics from the zip""" with ZipFile(INPUT_ZIP, 'r') as zipfile: # Sort the files by replies for file in sorted( zipfile.namelist(), key=lambda name: int(name.split('-')[0]), reverse=True): yield json.loads(zipfile.read(file)) def parse_topics(topics): """Parses the db topics and saves them to a CSV""" with open(OUTPUT_FILE, 'w', encoding='UTF-8') as output: for topic in topics: # Get the OP post = topic['posts'][-1] while not post['username']: # Post 3712180 is weird? del topic['posts'][-1] # Not a perfect solution post = topic['posts'][-1] # Items in result are saved to a line of the CSV result = ( # Topic Id str(topic['id']), # Number of posts (added above) str(topic['replies']), # Time of the OP format_date(post['time']['posted']), # Username of the OP post['username'], # Title without newlines or tabs re.sub(r"\t*\n*", "", topic['title']) ) output.write('\t'.join(result) + '\n') def parse_topics2(topics): """Quickly typed up, formats the data with bbcode""" with open(OUTPUT_FILE, 'w', encoding='UTF-8') as output: for topic in topics: # Get the data topicid = str(topic['id']) title = topic['title'] username = "" # topic['posts'][0]['username'] date = format_date(topic['posts'][0]['time']['posted']) # Write bbcode to file output.write( f"[quote][b][url=scratch.mit.edu/discuss/topic/" f"{topicid}]{title}[/url][/b]\n" f"By @{username} on {date}[/quote]\n" ) def format_date(date_str, new_format="%b. %d, %Y %H:%M:%S", old_format="%Y-%m-%dT%H:%M:%S.000Z"): """Returns a reformatted date string""" if date_str is None: return "0" date = time.strptime(date_str, old_format) return time.strftime(new_format, date) if __name__ == "__main__": main()
- imfh
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
You can get the date of the last post using "format_date(topic)" under def parse_topics(topics):. The list of items under result = ( … ) is what gets written to the CSV, so add the format date thing to that list.
Be aware that if the topic has more than 50 posts, it will give the date of the 50th post, so you'll need to make an exception for topics with more than 50 posts. It's possible to get the most recent post those too if you want, but it'll take a little more rewriting.
I might make a GitHub for this later so I can put up changes.
Be aware that if the topic has more than 50 posts, it will give the date of the 50th post, so you'll need to make an exception for topics with more than 50 posts. It's possible to get the most recent post those too if you want, but it'll take a little more rewriting.
I might make a GitHub for this later so I can put up changes.
- EpicGhoul993
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
- Ihatr
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
(Offtopic)RTL Overide.
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
I guess fdreerf wanted to test it?
- mybearworld
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
I guess it's fixed xD(Offtopic)RTL Overide.
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
I guess fdreerf wanted to test it?
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?
- EpicGhoul993
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
It will? Still a character after all.I guess it's fixed xD(Offtopic)RTL Overide.
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
I guess fdreerf wanted to test it?
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?
- Zerofile
-
Scratcher
100+ posts
Getting a list of topics with specific requirements
Yeah its called u+202e, and it just messes up formatting
- mybearworld
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
No, I'm talking about getting every script with 39< posts.It will? Still a character after all.I guess it's fixed xD(Offtopic)RTL Overide.
The topic names look weird: Getting a list of topics with specific requirements by fdreerf (New Posts)
I guess fdreerf wanted to test it?
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?
- mybearworld
-
Scratcher
1000+ posts
Getting a list of topics with specific requirements
Yeah its called u+202e, and it just messes up formattingYes, and also you can revert the character by using the character but that messes stuff REALLY up.
print(ascii("Yes, and also you can revert the character by using the character but that messes stuff REALLY up. "))
'Yes, and also you can revert the character \u202eby using the character \u202dbut that messes stuff REALLY up. '- Discussion Forums
- » Advanced Topics
-
» Getting a list of topics with specific requirements




