http.webpage#

Description#

Plugin to handle and parse/simplify web pages. It used to use the Mercury Reader web API, but now that the API is discontinued this plugin is basically a wrapper around the mercury-parser JavaScript library.

Requires:

  • The mercury-parser library installed (npm install -g @postlight/mercury-parser)

Configuration#

http.webpage:
  # [Optional]
  # Custom headers to be sent to the Mercury API.
  # headers:   # type=Optional[dict]

Dependencies#

pip

pip install weasyprint

Alpine

apk add nodejs npm sudo

Debian

apt install nodejs npm sudo

Fedora

yum install nodejs npm sudo

Arch Linux

pacman -S nodejs npm sudo

Post-install

sudo npm install -g @postlight/mercury-parser

Actions#

Module reference#

class platypush.plugins.http.webpage.HttpWebpagePlugin(*args, headers: dict | None = None, **kwargs)[source]#

Bases: Plugin

Plugin to handle and parse/simplify web pages. It used to use the Mercury Reader web API, but now that the API is discontinued this plugin is basically a wrapper around the mercury-parser JavaScript library.

Requires:

  • The mercury-parser library installed (npm install -g @postlight/mercury-parser)

__init__(*args, headers: dict | None = None, **kwargs)[source]#
Parameters:

headers – Custom headers to be sent to the Mercury API.

simplify(url: str, type: str | OutputFormats = OutputFormats.HTML, html: str | None = None, headers: dict | None = None, outfile: str | None = None, font_size: str = '19px', font_family: str | Iterable[str] = ('-apple-system', 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', 'Fira Sans', 'Open Sans', 'Droid Sans', 'Helvetica Neue', 'Helvetica', 'Arial', 'sans-serif'))[source]#

Parse the readable content of a web page removing any extra HTML elements using Mercury.

Parameters:
  • url – URL to parse.

  • type – Output format. Supported types: html, markdown, text, pdf (default: html).

  • html – Set this parameter if you want to parse some HTML content already fetched. Note that URL is still required by Mercury to properly style the output, but it won’t be used to actually fetch the content.

  • headers – Custom headers to be sent to the Mercury API.

  • outfile – If set then the output will be written to the specified file. If the file extension is .pdf then the content will be exported in PDF format. If the output type is not specified then it can also be inferred from the extension of the output file.

  • font_size – Font size to use for the output (default: 19px).

  • font_family – Custom font family (or list of font families, in decreasing order) to use for the output. It only applies to HTML and PDF.

Returns:

dict

Example return payload outfile is not specified:

{
    "url": <url>,
    "title": <page title>,
    "content": <page parsed content>
}

Example return payload if outfile is specified:

{
    "url": <url>,
    "title": <page title>,
    "outfile": <output file absolute path>
}
class platypush.plugins.http.webpage.OutputFormat(name: str, cmd_fmt: str, extensions: Iterable[str] = ())[source]#

Bases: object

Definition of a supported output format.

__init__(name: str, cmd_fmt: str, extensions: Iterable[str] = ()) None#
class platypush.plugins.http.webpage.OutputFormats(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Supported output formats.

classmethod parse(type: str | OutputFormats, outfile: str | None = None) OutputFormats[source]#

Parse the format given a type argument and and output file name.