visit
Outlined below is the setup for a AWS lambda function which combines
fetching the HTML for a URL, stripping it back to just the essential
article content, and then converting it to Markdown. To deploy it you’ll
need an AWS account, and to have the installed.
First get the full html of the url getting converted. As this is
running in a lambda function I decided to try out an ultra-lightweight
node http client called (which is 95% smaller than my usual favourite Axios):
const phin = require('phin')
const fetchPageHtml async fetchUrl => {
const response = await phin(fetchUrl)
return response.body;
};
const readability = require("readability");
const JSDOM = require("jsdom").JSDOM;
const extractMainContent = (pageHtml, url) => {
const doc = new JSDOM(pageHtml, {
url,
});
const reader = new Readability(doc.window.document);
const article = reader.parse();
return article.content;
};
If you need to convert files from one markup format into another, pandoc is your swiss-army knife.To try this out locally before running it from the lambda function, you can follow one of their , and then test it from the command line by piping a html file as the input:
cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none
-f html
is the input format-t commonmark
is the output format (a particular markdown flavour)commonmark-raw_html+backtick_code_block
sets the converter to disable the raw_html
extension, so no plain html is included in the output.It enables the
backtick_code_blocks
extension so that any code blocks are fenced with backticks rather than being indented.The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the html to the child
stdin
and then collect the markdown output via the child stdout
.Most of these functions have been taken from on working with stdout and stdin in nodejs.First off this is the generic streamWrite function, which allows you to pipe the html to the pandoc process, by writing to the
stdin
stream of the child process.const streamWrite = async (stream, chunk, encoding = 'utf8') =>
new Promise((resolve, reject) => {
const errListener = (err) => {
stream.removeListener('error', errListener);
reject(err);
};
stream.addListener('error', errListener);
const callback = () => {
stream.removeListener('error', errListener);
resolve(undefined);
};
stream.write(chunk, encoding, callback);
});
This similar function reads from the
stdout
stream of the child process, so you can collect the markdown that is output:const {chunksToLinesAsync, chomp} = require('@rauschma/stringio');
const collectFromReadable = async (readable) => {
let lines = [];
for await (const line of chunksToLinesAsync(readable)) {
lines.push(chomp(line));
}
return lines;
}
const onExit = async (childProcess) =>
new Promise((resolve, reject) => {
childProcess.once('exit', (code) => {
if (code === 0) {
resolve(undefined);
} else {
reject(new Error('Exit with error code: '+code));
}
});
childProcess.once('error', (err) => {
reject(err);
});
});
// spawns a child process, supplying stdin to the child STDIN, then reads from the child STDOUT and
// returns this as a string
const spawnHelper = async (command, stdin) => {
const commandParts = command.split(" ");
const childProcess = spawn(commandParts[0], commandParts.slice(1))
await streamWrite(childProcess.stdin, stdin);
childProcess.stdin.end();
const outputLines = await collectFromReadable(childProcess.stdout);
await onExit(childProcess);
return outputLines.join("\n");
}
const convertToMarkdown = async (html) => {
const convertedOutput = await spawnHelper('/opt/bin/pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none', html)
return convertedOutput;
}
To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a
precompiled pandoc binary. You can , or just include the in your serverless config.
# function config
layers:
- arn:aws:lambda:us-east-1:5:layer:pandoc:1
Export a function from this module which has been configured as the
handler. This is the function AWS will run every time the lambda
receives a request.
module.exports.endpoint = async (event) => {
const url = event.body
const pageHtml = await fetchPageHtml(url);
const article = await extractMainContent(pageHtml, url);
const bodyMarkdown = await convertToMarkdown(article.content);
// add the title and source url to the top of the markdown
const markdown = `# ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}`
return {
statusCode: 200,
body: markdown,
headers: {
'Content-type': 'text/markdown'
}
}
}
This is the full
serverless.yml
configuration that is needed for serverless to deploy everything:service: url-to-markdown
frameworkVersion: ">=1.1.0 <2.0.0"
provider:
name: aws
runtime: nodejs12.x
region: us-east-1
functions:
downloadAndConvert:
handler: handler.endpoint
timeout: 10
layers:
- arn:aws:lambda:us-east-1:5:layer:pandoc:1
events:
- http:
path: convert
method: post
curl -X POST -d '//www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska' //zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert
```
Previously published at