使用 JavaScript 抓取页面¶
The shot-scraper javascript
命令可用于直接对页面执行 JavaScript 并以 JSON 格式返回结果。
此命令不生成屏幕截图,但对抓取有有趣的用途。
检索文档的字符串标题
shot-scraper javascript https://datasette.com.cn/ "document.title"
这会返回一个 JSON 字符串
"Datasette: An open source multi-tool for exploring and publishing data"
要返回原始字符串,请改用 -r
或 --raw
选项
shot-scraper javascript https://datasette.com.cn/ "document.title" -r
输出
Datasette: An open source multi-tool for exploring and publishing data
要返回 JSON 对象,请将对象字面量用括号括起来
shot-scraper javascript https://datasette.com.cn/ "({
title: document.title,
tagline: document.querySelector('.tagline').innerText
})"
这会返回
{
"title": "Datasette: An open source multi-tool for exploring and publishing data",
"tagline": "An open source multi-tool for exploring and publishing data"
}
运行多个语句¶
您可以使用 () => { ... }
函数语法来运行多个语句,并在函数末尾返回结果。
如果找不到段落,此示例将引发错误。
shot-scraper javascript https://www.example.com/ "
() => {
var paragraphs = document.querySelectorAll('p');
if (paragraphs.length == 0) {
throw 'No paragraphs found';
}
return Array.from(paragraphs, el => el.innerText);
}"
使用 async/await¶
如果您想使用 await
,可以传递一个 async
函数,包括从外部 URL 导入模块。此示例从 jsdelivr 加载 Readability.js 库,并用它来提取页面的核心内容
shot-scraper javascript \
https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/ "
async () => {
const readability = await import('https://cdn.jsdelivr.net.cn/npm/@mozilla/readability@0.6.0/+esm');
return (new readability.Readability(document)).parse();
}"
要使用诸如 setInterval()
之类的函数,例如如果需要延迟拍摄一秒钟以等待动画完成,则返回一个 Promise
shot-scraper javascript datasette.io "
new Promise(done => setInterval(
() => {
done({
title: document.title,
tagline: document.querySelector('.tagline').innerText
});
}, 1000
));"
绕过内容安全策略头¶
一些网站使用 内容安全策略 (CSP) 头来防止额外的 JavaScript 在页面上执行,作为一项安全措施。
使用 shot-scraper
时,这可能会阻止某些 JavaScript 功能正常工作。您可能会看到如下所示的错误消息
shot-scraper javascript github.com "
async () => {
await import('https://cdn.jsdelivr.net.cn/npm/left-pad/+esm');
return 'content-security-policy ignored' }
"
输出
Error: TypeError: Failed to fetch dynamically imported module:
https://cdn.jsdelivr.net/npm/left-pad/+esm
您可以使用 --bypass-csp
选项让 shot-scraper
在忽略这些头的模式下运行浏览器
shot-scraper javascript github.com "
async () => {
await import('https://cdn.jsdelivr.net.cn/npm/left-pad/+esm');
return 'content-security-policy ignored' }
" --bypass-csp
输出
"content-security-policy ignored"
将此用于自动化测试¶
如果发生 JavaScript 错误,堆栈跟踪将写入标准错误,并且该工具将以退出代码 1 终止。
这可以用于在持续集成环境中运行 JavaScript 测试,利用 throw "error message"
JavaScript 语句。
- name: Test page title
run: |-
shot-scraper javascript datasette.io "
if (document.title != 'Datasette') {
throw 'Wrong title detected';
}"
从文件运行 JavaScript¶
您也可以将 JavaScript 保存到文件中并像这样执行
shot-scraper javascript datasette.io -i script.js
或者像这样从标准输入读取
echo "document.title" | shot-scraper javascript datasette.io
或者像这样从标准输入读取
echo "document.title" | shot-scraper javascript datasette.io
从 GitHub 运行 JavaScript¶
一个特殊的 gh:
前缀可用于从 GitHub 加载脚本。
您可以将其与公共 GitHub 仓库中 script.js
文件的完整路径一起使用,如下所示
shot-scraper javascript datasette.io -i gh:simonw/shot-scraper-scripts/readability.js
或者按照惯例,如果脚本位于名为 shot-scraper-scripts
的仓库中,您可以省略该部分(以及 .js
扩展名),如下所示
shot-scraper javascript datasette.io -i gh:simonw/readability
这两个示例都将执行 readability.js,下一节将对此进行解释。
示例:使用 Readability.js 提取页面内容¶
Readability.js 是“用于 Firefox 阅读器视图的可读性库的独立版本”。它允许您解析网页上的内容,并仅提取标题、内容、署名和其他一些关键元数据。
以下方法从 jsdelivr CDN 导入库,对当前页面运行它并将结果作为 JSON 返回到控制台
shot-scraper javascript https://simonwillison.net/2022/Mar/24/datasette-061/ "
async () => {
const readability = await import('https://cdn.jsdelivr.net.cn/npm/@mozilla/readability@0.6.0/+esm');
return (new readability.Readability(document)).parse();
}"
输出如下所示
{
"title": "Datasette 0.61: The annotated release notes",
"byline": null,
"dir": null,
"lang": "en-gb",
"content": "<div id=\"readability-page-1\" class=\"page\"><div id=\"primary\">\n\n\n\n\n<p>I released ... <this is a very long string>",
"length": 8625,
"excerpt": "I released Datasette 0.61 this morning\u2014closely followed by 0.61.1 to fix a minor bug. Here are the annotated release notes. In preparation for Datasette 1.0, this release includes two potentially \u2026",
"siteName": null
}
shot-scraper javascript –help¶
此命令的完整 --help
Usage: shot-scraper javascript [OPTIONS] URL [JAVASCRIPT]
Execute JavaScript against the page and return the result as JSON
Usage:
shot-scraper javascript https://datasette.io/ "document.title"
To return a JSON object, use this:
"({title: document.title, location: document.location})"
To use setInterval() or similar, pass a promise:
"new Promise(done => setInterval(
() => {
done({
title: document.title,
h2: document.querySelector('h2').innerHTML
});
}, 1000
));"
If a JavaScript error occurs an exit code of 1 will be returned.
Options:
-i, --input TEXT Read input JavaScript from this file or use
gh:username/script to load from
github.com/username/shot-scraper-
scripts/script.js
-a, --auth FILENAME Path to JSON authentication context file
-o, --output FILENAME Save output JSON to this file
-r, --raw Output JSON strings as raw text
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--browser-arg TEXT Additional arguments to pass to the browser
--user-agent TEXT User-Agent header to use
--reduced-motion Emulate 'prefers-reduced-motion' media feature
--log-console Write console.log() to stderr
--fail Fail with an error code if a page returns an
HTTP error
--skip Skip pages that return HTTP errors
--bypass-csp Bypass Content-Security-Policy
--auth-password TEXT Password for HTTP Basic authentication
--auth-username TEXT Username for HTTP Basic authentication
--help Show this message and exit.