Getting a list of topics with specific requirements‮

imfh

fdreerf wrote:
Words cannot express my gratitude!

It’s no problem! I enjoy writing programs to do things like this.

Last edited by imfh (Oct. 28, 2020 00:46:20)

imfh

In order to run the program, you'll need Python 3.6+ and the requests module. If you don't have Python 3.6+, I can change it pretty easily to work with an earlier version. The requests module can be installed using pip. Basically, open the command line and type: “py -m pip install –user requests” (–user can be left off if you have admin and want it for all users).

There are two parts to the program. If you run part 1 and then part 2, it will (as currently setup) make a file called “DB_List.csv” which contains a list of every topic on the first 10 pages of suggestions.

Part 1 gets each page of the Suggestions forum mobile view, saves the html to a zip file, and copies the important info into a csv file. You can adjust the start and end pages using START_PAGE and END_PAGE at the top of the file.

Part 2 reads the csv file of Part 1 to get the topicid and replies. It then gets each topic from ScratchDB, saves the json to a zip file, and copies the important info into another csv file. You can also adjust which topics it gets from ScratchDB, but I have it set to just get everything from part 1.

Both files also have delay options to try and be nicer to the servers. You should be able to lower them if you want it to go faster.

You might want to copy the code by after clicking quote since the code box removes extra newlines. It should still work the same, but it makes everything more readable.

file 1:

"""
The exported CSV data is in this order:
    replies, open? (1/0), topic id, title
The items are seperated by tabs and newlines.
All tabs and newlines are removed from the title.
START_PAGE - The first page of mobile view to retrieve
END_PAGE - The last page to retrieve
OUTPUT_FILE - The output CSV path
ZIP_PATH - Where to save the output zip
HEADER - Let Scratch know where the requests are from (sort of)
URL - Path to the mobile view
REQUEST_DELAY - Base delay to wait
ADAPTIVE_DELAY - Multiplied by server response time
PATTERN - Regex used to parse the mobile view html
"""
import re
import time
from html import unescape
from zipfile import ZipFile
import requests  # Can be installed from pip
START_PAGE = 1
END_PAGE = 10
OUTPUT_FILE = "MobileList.csv"
ZIP_PATH = f"MobilePages{START_PAGE}-{END_PAGE}.zip"
HEADER = {'User-Agent': 'SuggestionsIndexing'}
URL = "https://scratch.mit.edu/discuss/m/1/?page="
# delay = request delay + (adaptive delay * response time)
REQUEST_DELAY = 2
ADAPTIVE_DELAY = 1
PATTERN = (r"(closed \"\>)?(?# Closed Topic)"
           r"\s+\<a class\=\"item\" href\="
           r"\"\/discuss\/m\/topic\/(\d+)(?# Topic Id)"
           r"\/\"\>\s+\<strong\>(.+?)(?# Topic Title)"
           r"\<\/strong\>\s+\<span\>(\d+)(?# Topic Replies)"
           r" replies\<\/span\>\s+\<\/a\>")
# Example segment of HTML
#         <a class="item" href="/discuss/m/topic/343602/">
#             <strong>The Official List of Rejected Suggestions</strong>
#             <span>2573 replies</span>
#         </a>
def main():
    """Run the program"""
    with open(OUTPUT_FILE, 'a', encoding='utf-8') as output:
        with ZipFile(ZIP_PATH, 'a') as zipfile:
            for number in range(START_PAGE, END_PAGE + 1):
                # Download the page
                text, delay = get_page(number, zipfile)
                # Parse the page
                data = parse_page(text)
                # Save the page to the file
                output.write(data)
                # Output status info
                print(f"Read page {number} in {delay:.2f}s.",
                      f"Delay: {REQUEST_DELAY + ADAPTIVE_DELAY * delay:.2f}")
                # Wait for a delay
                time.sleep(REQUEST_DELAY)
                time.sleep(ADAPTIVE_DELAY * delay)
def get_page(number, zipfile):
    """Gets a page from the URL and saves it to the zip"""
    resp = requests.get(URL + str(number), headers=HEADER)
    zipfile.writestr(f"p{number}.html", resp.content)
    return unescape(resp.text), resp.elapsed.total_seconds()
def parse_page(text):
    """Extracts data from a page into csv format"""
    result = []
    for match in re.finditer(PATTERN, text):
        # Get data from each group
        closed, topic, title, replies = match.groups()
        # Format the data
        closed = '0' if closed else '1'
        title = re.sub(r"\t*\n*", "", title)  # remove \t and \n
        # Add the data to the result list
        result.append('\t'.join((replies, closed, topic, title)))
    return '\n'.join(result) + '\n'
# def get_dummy(_):
#     """Get a saved page for testing purposes"""
#     with open("page1.html", 'r') as file:
#         return file.read(), 0
if __name__ == "__main__":
    main()

file 2:

"""
This program gets information about a list of topic ids from
ScratchDB. Consolidated information is saved in a json, and all
retrieved info (including the first 50 posts) is saved to a zip.
START_TOPIC - The first topic in the list to read
END_TOPIC - The last topic in the list to read. Also stops if
    it runs out of topics.
INPUT_LIST - The path to the topic_list zip
OUTPUT_FILE - An output CSV of consolidated useful information
OUTPUT_ZIP - A data retrieved from ScratchDB is saved here so it
    doesn't have to be retrieved again for more info
INPUT_ZIP - Used if reading data retrieved from ScratchDB previously
    rather than getting it from the server again
HEADER - Let ScratchDB know where the requests come from (sort of).
URL - The url to ScratchDB
REQUEST_DELAY - Delay between requests, in seconds
ADAPTIVE_DELAY - Additional delay, multiplied by server work time
"""
import json
import math
import re
import time
from zipfile import ZipFile
import requests  # Can be installed from pip
START_TOPIC = 0  # Skip this many
END_TOPIC = float('inf')  # Stop after this many
INPUT_LIST = "MobileList.csv"
OUTPUT_FILE = "DB_List.csv"
OUTPUT_ZIP = f"DB_data.zip"
INPUT_ZIP = "DB_data.zip"  # zip made by this program
HEADER = {'User-Agent': 'SuggestionIndexing'}
URL = "https://scratchdb.lefty.one/v2/forum/topic/"
REQUEST_DELAY = 0
ADAPTIVE_DELAY = 2
MAX_TIME = 15  # Max time before timeout
TIMEOUT_DELAY = 5  # Time to wait before retrying
def main():
    """Run the program"""
    # Download the pages
    data = get_topics()
    # Read the pages from a zip
    # data = read_topics()
    # Save result to a CSV
    parse_topics(data)
    # Format result to bbcode?
    # parse_topics2(data)
def get_topics():
    """An iterator to get topics from ScratchDB"""
    with open(INPUT_LIST, 'r', encoding="utf-8") as topics:
        with ZipFile(OUTPUT_ZIP, 'w') as zipfile:
            for i, line in enumerate(topics):
                # Skip the first topics
                if i < START_TOPIC - 1:
                    continue
                # Stop early
                if i > END_TOPIC - 1:
                    break
                replies, _, topicid = line.split()[0:3]
                # Guess the DB page number
                page = math.ceil(int(replies) / 50) - 1
                if page < 0:
                    continue  # 0 replies
                # Get the page
                topic, content = get_topic(
                    topicid, math.ceil(int(replies) / 50) - 1)
                topic['replies'] = replies
                # Ignore empty topics
                if len(topic['posts']) == 0:
                    continue
                # Save it to the zipfile
                zipfile.writestr(f"{replies}-{topicid}.json", content)
                # Yield the topic from the iterator
                yield topic
def get_topic(topicid, page, previous=None):
    """Gets the db topic, writes it to the zip, and adds replies"""
    # To get the oldest post on a topic, we have to get the
    # last page. To guess the last page, take the number of
    # posts / 50. If the guess is too high and no posts are
    # returned, try the previous page. If the guess is possibly
    # too low and 50 posts are returned, try the next page.
    # Grab the page
    resp = request_timeout(f"{URL}{topicid}/{page}")
    topic = resp.json()
    # Output status info
    delay = topic['query_time']
    print(f"Retrieved topic {topicid}/{page} in {delay}ms.",
          f"Delay: {REQUEST_DELAY + ADAPTIVE_DELAY * delay / 1000:.2f}")
    # Wait for a delay
    time.sleep(REQUEST_DELAY)
    time.sleep(ADAPTIVE_DELAY * delay / 1000)
    # Check if the guessed page was too high
    if len(topic['posts']) == 0:
        return previous or (page > 1 and get_topic(topicid, page - 1))
    # Check if the guessed page was too low
    if len(topic['posts']) == 50:
        # Posts may have been added
        return get_topic(topicid, page + 1, (topic, resp.content))
    return topic, resp.content
def request_timeout(url):
    """Sends a request and handles timeouts"""
    while True:
        try:
            return requests.get(url, headers=HEADER, timeout=MAX_TIME)
        except requests.exceptions.Timeout:
            print(f"Timeout for '{url}' Delay: {TIMEOUT_DELAY}")
            time.sleep(TIMEOUT_DELAY)
def read_topics():
    """An iterator to read topics from the zip"""
    with ZipFile(INPUT_ZIP, 'r') as zipfile:
        # Sort the files by replies
        for file in sorted(
                zipfile.namelist(),
                key=lambda name: int(name.split('-')[0]),
                reverse=True):
            yield json.loads(zipfile.read(file))
def parse_topics(topics):
    """Parses the db topics and saves them to a CSV"""
    with open(OUTPUT_FILE, 'w', encoding='UTF-8') as output:
        for topic in topics:
            # Get the OP
            post = topic['posts'][-1]
            while not post['username']:  # Post 3712180 is weird?
                del topic['posts'][-1]  # Not a perfect solution
                post = topic['posts'][-1]
            # Items in result are saved to a line of the CSV
            result = (
                # Topic Id
                str(topic['id']),
                # Number of posts (added above)
                str(topic['replies']),
                # Time of the OP
                format_date(post['time']['posted']),
                # Username of the OP
                post['username'],
                # Title without newlines or tabs
                re.sub(r"\t*\n*", "", topic['title'])
            )
            output.write('\t'.join(result) + '\n')
def parse_topics2(topics):
    """Quickly typed up, formats the data with bbcode"""
    with open(OUTPUT_FILE, 'w', encoding='UTF-8') as output:
        for topic in topics:
            # Get the data
            topicid = str(topic['id'])
            title = topic['title']
            username = ""  # topic['posts'][0]['username']
            date = format_date(topic['posts'][0]['time']['posted'])
            # Write bbcode to file
            output.write(
                f"[quote][b][url=scratch.mit.edu/discuss/topic/"
                f"{topicid}]{title}[/url][/b]\n"
                f"By @{username} on {date}[/quote]\n"
            )
def format_date(date_str, new_format="%b. %d, %Y %H:%M:%S", old_format="%Y-%m-%dT%H:%M:%S.000Z"):
    """Returns a reformatted date string"""
    if date_str is None:
        return "0"
    date = time.strptime(date_str, old_format)
    return time.strftime(new_format, date)
if __name__ == "__main__":
    main()

imfh

You can get the date of the last post using "format_date(topic)" under def parse_topics(topics):. The list of items under result = ( … ) is what gets written to the CSV, so add the format date thing to that list.

Be aware that if the topic has more than 50 posts, it will give the date of the 50th post, so you'll need to make an exception for topics with more than 50 posts. It's possible to get the most recent post those too if you want, but it'll take a little more rewriting.

I might make a GitHub for this later so I can put up changes.

EpicGhoul993

(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements‮ by fdreerf (New Posts)

Ihatr

EpicGhoul993 wrote:
(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements‮ by fdreerf (New Posts)

RTL Overide.

I guess fdreerf wanted to test it?

mybearworld

Ihatr wrote:
EpicGhoul993 wrote:
(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements‮ by fdreerf (New Posts)
RTL Overide.

I guess fdreerf wanted to test it?

I guess it's fixed xD
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?

EpicGhoul993

mybearworld wrote:
Ihatr wrote:
EpicGhoul993 wrote:
(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements‮ by fdreerf (New Posts)
RTL Overide.

I guess fdreerf wanted to test it?
I guess it's fixed xD
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?

It will? Still a character after all.

Zerofile

Yeah its called u+202e, ‮and it just messes up formatting

mybearworld

EpicGhoul993 wrote:
mybearworld wrote:
Ihatr wrote:
EpicGhoul993 wrote:
(Offtopic)
The topic names look weird: Getting a list of topics with specific requirements‮ by fdreerf (New Posts)
RTL Overide.

I guess fdreerf wanted to test it?
I guess it's fixed xD
I love how the BBCode says [etouq/]…
Also, won't that hit the character limit?
It will? Still a character after all.

No, I'm talking about getting every script with 39< posts.

mybearworld

Zerofile wrote:
Yeah its called u+202e, ‮and it just messes up formatting

Yes, and also you can revert the character ‮by using the character ‭but that messes stuff REALLY up.

print(ascii("Yes, and also you can revert the character ‮by using the character ‭but that messes stuff REALLY up. "))
'Yes, and also you can revert the character \u202eby using the character \u202dbut that messes stuff REALLY up. '

Discuss Scratch