core

Core functions for dialogify

Introduction

The Python Standard Library documentation is very helpful for learning Python. So is Solveit! Solveit is jupyter notebook + AI with superpowers. Learning programming is so much fun and productive with AI. Therefore, I wanted to convert these html python documentation pages into solveit dialogues, which comprise small pieces of notes and code messages with appropriate headings, which can be extracted from the pages’ table of contents.

How it works:

  • We first get the html from the python documentation web page.
  • We turn it into (msg_type, element) where msg_type is note or code and element is soup element.
  • Turn elements into appropriate solveit messages for the dialog.

The goal is to use # for the title, ## for subheading, and ### for each function definition from the docs.

First, we grab html from the documentation and create soup.

doc_url = 'https://docs.python.org/3/library/random.html'
doc_html = httpx.get(doc_url).text
doc_html[:600]
'<!DOCTYPE html>\n\n<html lang="en" data-content_root="../">\n  <head>\n    <meta charset="utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />\n<meta property="og:title" content="random — Generate pseudo-random numbers" />\n<meta property="og:type" content="website" />\n<meta property="og:url" content="https://docs.python.org/3/library/random.html" />\n<meta property="og:site_name" content="Python documentation" />\n<meta property="og:description" content="Source code: Lib/random.py This module imple'
soup = BeautifulSoup(doc_html, 'html.parser')

Some helpful utilities

Here are some utility functions for getting the main content, cleaning text, getting title, etc.


source

get_main


def get_main(
    soup
):

Extract the main content section from Python docs soup

ms = get_main(soup); str(ms)[:300]
'<section id="module-random">\n<span id="random-generate-pseudo-random-numbers"></span><h1><code class="xref py py-mod docutils literal notranslate"><span class="pre">random</span></code> — Generate pseudo-random numbers<a class="headerlink" href="#module-random" title="Link to this heading">¶</a></h1'

source

clean_txt


def clean_txt(
    el
):

Clean element text by removing paragraph signs and extra whitespace


source

get_title


def get_title(
    section
):

Extract the h1 title from a section

get_title(ms)
'random — Generate pseudo-random numbers'

Before turning the soup into markdown, we turn these into each sections as in (title, section) tuples.


source

get_sections


def get_sections(
    main
):

Get all direct child sections as (title, section_element) tuples

len(get_sections(ms))
12

We can grab sections and grab the bookkeeping section

sts = get_sections(ms)
bk = sts[0][1]
str(bk)[:300]
'<section id="bookkeeping-functions">\n<h2>Bookkeeping functions<a class="headerlink" href="#bookkeeping-functions" title="Link to this heading">¶</a></h2>\n<dl class="py function">\n<dt class="sig sig-object py" id="random.seed">\n<span class="sig-prename descclassname"><span class="pre">random.</span><'

Looking at the preview to check if it is looking good.


source

preview_msgs


def preview_msgs(
    msgs
):

Preview message tuples as rendered markdown

preview_msgs(get_sections(ms)[:2])

[Bookkeeping functions]

Bookkeeping functions

random.seed(a=None, version=2)

Initialize the random number generator.

If a is omitted or None, the current system time is used. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).

If a is an int, its absolute value is used directly.

With version 2 (the default), a str, bytes, or bytearray object gets converted to an int and all of its bits are used.

With version 1 (provided for reproducing random sequences from older versions of Python), the algorithm for str and bytes generates a narrower range of seeds.

Changed in version 3.2: Moved to the version 2 scheme which uses all of the bits in a string seed.

Changed in version 3.11: The seed must be one of the following types: None, int, float, str, bytes, or bytearray.

random.getstate()

Return an object capturing the current internal state of the generator. This object can be passed to setstate() to restore the state.

random.setstate(state)

state should have been obtained from a previous call to getstate(), and setstate() restores the internal state of the generator to what it was at the time getstate() was called.

[Functions for bytes]

Functions for bytes

random.randbytes(n)

Generate n random bytes.

This method should not be used for generating security tokens. Use secrets.token_bytes() instead.

Added in version 3.9.

html_to_md turns html into md for appropriate tags.


source

html_to_md


def html_to_md(
    el, in_link:bool=False
):

Recursively convert HTML element to markdown string


source

html_to_md_children


def html_to_md_children(
    el, in_link:bool=False
):

Convert all children of an HTML element to markdown

print(html_to_md(bk))

Bookkeeping functions[¶](#bookkeeping-functions)


random.seed(*a=None*, *version=2*)[¶](#random.seed)
Initialize the random number generator.
If *a* is omitted or `None`, the current system time is used.  If
randomness sources are provided by the operating system, they are used
instead of the system time (see the [os.urandom()](os.html#os.urandom) function for details
on availability).
If *a* is an int, its absolute value is used directly.
With version 2 (the default), a [str](stdtypes.html#str), [bytes](stdtypes.html#bytes), or [bytearray](stdtypes.html#bytearray)
object gets converted to an [int](functions.html#int) and all of its bits are used.
With version 1 (provided for reproducing random sequences from older versions
of Python), the algorithm for [str](stdtypes.html#str) and [bytes](stdtypes.html#bytes) generates a
narrower range of seeds.

Changed in version 3.2: Moved to the version 2 scheme which uses all of the bits in a string seed.


Changed in version 3.11: The *seed* must be one of the following types:
`None`, [int](functions.html#int), [float](functions.html#float), [str](stdtypes.html#str),
[bytes](stdtypes.html#bytes), or [bytearray](stdtypes.html#bytearray).




random.getstate()[¶](#random.getstate)
Return an object capturing the current internal state of the generator.  This
object can be passed to [setstate()](#random.setstate) to restore the state.



random.setstate(*state*)[¶](#random.setstate)
*state* should have been obtained from a previous call to [getstate()](#random.getstate), and
[setstate()](#random.setstate) restores the internal state of the generator to what it was at
the time [getstate()](#random.getstate) was called.

soup to (msg_type, el)

Solveit messages have Code, Note, Prompt, and Raw for message types. But we want to focus on note and code for creating dialogs. By turning soup into (msg_type, el), we can easily turn those into sovleit messages with markdown.


source

has_cls


def has_cls(
    el, cls
):

Check if element has a specific CSS class

dt is special because it is used for function definition in python docs.


source

get_msg_type


def get_msg_type(
    el
):

Determine message type (‘note’, ‘code’, or ‘dt’) for an HTML element


source

collect_msgs


def collect_msgs(
    el
):

Recursively collect (msg_type, element) tuples from HTML tree


source

format_msg


def format_msg(
    msg_type, el
):

Convert (msg_type, element) tuple to (msg_type, markdown_string)


source

table_to_md


def table_to_md(
    table
):

Convert HTML table to markdown format

Some functions/classes on the doc has multiple signatures. In this case, dts need to be merged into a single message as a heading.


source

merge_dt


def merge_dt(
    msgs
):

Merge consecutive ‘dt’ messages into single heading notes


source

format_msgs


def format_msgs(
    el
):

Convert HTML element to list of formatted (msg_type, markdown) tuples

Let’s try it on bytearray function from the “https://docs.python.org/3.12/library/functions.html”.

bytearray_html = '''<dl class="py class" id="func-bytearray">
<dt class="sig sig-object py">
<em class="property"><span class="k"><span class="pre">class</span></span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">bytearray</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">source</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">b''</span></span></em><span class="sig-paren">)</span></dt>
<dt class="sig sig-object py">
<em class="property"><span class="k"><span class="pre">class</span></span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">bytearray</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">source</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoding</span></span></em><span class="sig-paren">)</span></dt>
<dt class="sig sig-object py">
<em class="property"><span class="k"><span class="pre">class</span></span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">bytearray</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">source</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">encoding</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">errors</span></span></em><span class="sig-paren">)</span></dt>
<dd><p>Return a new array of bytes.</p>
<p>The optional <em>source</em> parameter can be used to initialize the array:</p>
<ul class="simple">
<li><p>If it is a <em>string</em>, you must also give the <em>encoding</em>.</p></li>
<li><p>If it is an <em>integer</em>, the array will have that size.</p></li>
</ul>
<p>Without an argument, an array of size 0 is created.</p>
</dd></dl>'''
ba_soup = BeautifulSoup(bytearray_html, 'html.parser')
preview_msgs(collect_msgs(ba_soup.dl))

[dt]

class bytearray(source=b’’)

[dt]

class bytearray(source, encoding)

[dt]

class bytearray(source, encoding, errors)

[note]

Return a new array of bytes.

[note]

The optional source parameter can be used to initialize the array:

[note]

  • If it is a string, you must also give the encoding.

  • If it is an integer, the array will have that size.

[note]

Without an argument, an array of size 0 is created.

ba_msgs = format_msgs(ba_soup)
ba_msgs
[('note',
  "### `class bytearray(source=b'')`\n### `class bytearray(source, encoding)`\n### `class bytearray(source, encoding, errors)`"),
 ('note', 'Return a new array of bytes.'),
 ('note',
  'The optional *source* parameter can be used to initialize the array:'),
 ('note',
  '\n- If it is a *string*, you must also give the *encoding*.\n- If it is an *integer*, the array will have that size.\n'),
 ('note', 'Without an argument, an array of size 0 is created.')]
merge_dt(ba_msgs)
[('note',
  "### `class bytearray(source=b'')`\n### `class bytearray(source, encoding)`\n### `class bytearray(source, encoding, errors)`"),
 ('note', 'Return a new array of bytes.'),
 ('note',
  'The optional *source* parameter can be used to initialize the array:'),
 ('note',
  '\n- If it is a *string*, you must also give the *encoding*.\n- If it is an *integer*, the array will have that size.\n'),
 ('note', 'Without an argument, an array of size 0 is created.')]
preview_msgs(format_msgs(ba_soup))

[note]

class bytearray(source=b'')

class bytearray(source, encoding)

class bytearray(source, encoding, errors)

[note]

Return a new array of bytes.

[note]

The optional source parameter can be used to initialize the array:

[note]

  • If it is a string, you must also give the encoding.
  • If it is an integer, the array will have that size.

[note]

Without an argument, an array of size 0 is created.

Looks good! We can use create_msg to create solveit messages.


source

create_msgs


def create_msgs(
    doc_tuples, dname:str='', kwargs:VAR_KEYWORD
):

Create solveit messages from list of (msg_type, content) tuples

# create_msgs(format_msgs(ms))

And we can make dialogs.


source

mk_dialog


def mk_dialog(
    url, dname:str=''
):

Fetch Python docs URL and create a solveit dialog from it

Here are examples to create solveit dialogs:

# mk_dialog('https://docs.python.org/3.12/library/functions.html', dname='dialogify/testing')
# mk_dialog('https://docs.python.org/3.12/howto/regex.html#regex-howto', dname='dialogify/regex_howto')
# mk_dialog('https://docs.python.org/3.12/howto/regex.html#regex-howto')