http.webpage#
Description#
Plugin to handle and parse/simplify web pages. It uses the Mozilla Readability library JavaScript library.
Configuration#
http.webpage:
# [Optional]
# Custom headers to be sent to the Readability API.
# headers: # type=dict | None
Dependencies#
pip
pip install weasyprint markdownify
Alpine
apk add npm nodejs
Debian
apt install npm nodejs
Fedora
yum install npm nodejs
Arch Linux
pacman -S npm python-markdownify nodejs
Actions#
Module reference#
- class platypush.plugins.http.webpage.HttpWebpagePlugin(*args, headers: dict | None = None, **kwargs)[source]#
Bases:
PluginPlugin to handle and parse/simplify web pages. It uses the Mozilla Readability library JavaScript library.
- __init__(*args, headers: dict | None = None, **kwargs)[source]#
- Parameters:
headers – Custom headers to be sent to the Readability API.
- simplify(url: str, *, type: str | OutputFormats = OutputFormats.HTML, html: str | None = None, headers: dict | None = None, outfile: str | None = None, font_size: str = '19px', font_family: str | Iterable[str] = ('-apple-system', 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', 'Fira Sans', 'Open Sans', 'Droid Sans', 'Helvetica Neue', 'Helvetica', 'Arial', 'sans-serif'))[source]#
Parse the readable content of a web page removing any extra HTML elements using Readability.
- Parameters:
url – URL to parse.
type – Output format. Supported types:
html,markdown,text,pdf(default:html).html – Set this parameter if you want to parse some HTML content already fetched. Note that URL is still required by Readability to properly style the output, but it won’t be used to actually fetch the content.
headers – Custom headers to be sent to the Readability API.
outfile – If set then the output will be written to the specified file. If the file extension is
.pdfthen the content will be exported in PDF format. If the outputtypeis not specified then it can also be inferred from the extension of the output file.font_size – Font size to use for the output (default: 19px).
font_family – Custom font family (or list of font families, in decreasing order) to use for the output. It only applies to HTML and PDF.
- Returns:
dict
Example return payload outfile is not specified:
{ "url": <url>, "title": <page title>, "content": <page parsed content> }
Example return payload if outfile is specified:
{ "url": <url>, "title": <page title>, "outfile": <output file absolute path> }
- class platypush.plugins.http.webpage.OutputFormat(name: str, cmd_fmt: str, extensions: Iterable[str] = ())[source]#
Bases:
objectDefinition of a supported output format.
- class platypush.plugins.http.webpage.OutputFormats(*values)[source]#
Bases:
EnumSupported output formats.
- classmethod parse(type: str | OutputFormats, outfile: str | None = None) OutputFormats[source]#
Parse the format given a type argument and and output file name.