http.webpage

http.webpage

class platypush.plugins.http.webpage.HttpWebpagePlugin(**kwargs)[source]

Plugin to handle and parse/simplify web pages. It used to use the Mercury Reader web API, but now that the API is discontinued this plugin is basically a wrapper around the mercury-parser JavaScript library.

Requires:

  • weasyprint (pip install weasyprint), optional, for HTML->PDF conversion

  • node and npm installed on your system (to use the mercury-parser interface)

  • The mercury-parser library installed (npm install -g @postlight/mercury-parser)

__init__(**kwargs)
simplify(url, type='html', html=None, outfile=None)[source]

Parse the readable content of a web page removing any extra HTML elements using Mercury.

Parameters:
  • url – URL to parse.

  • type – Output format. Supported types: html, markdown, text (default: html).

  • html – Set this parameter if you want to parse some HTML content already fetched. Note that URL is still required by Mercury to properly style the output, but it won’t be used to actually fetch the content.

  • outfile – If set then the output will be written to the specified file. If the file extension is .pdf then the content will be exported in PDF format. If the output type is not specified then it can also be inferred from the extension of the output file.

Returns:

dict

Example if outfile is not specified:

{
    "url": <url>,
    "title": <page title>,
    "content": <page parsed content>

}

Example if outfile is specified:

{
    "url": <url>,
    "title": <page title>,
    "outfile": <output file absolute path>

}