1,456 reads

How To Convert HTML to Markdown with a Serverless Function

by Michael VigorAugust 17th, 2020

Too Long; Didn't Read

How To Convert HTML to Markdown with a Serverless Function using AWS Lambda function. The function converts HTML for a URL into Markdown. It is designed for blog posts, news articles which have a clear "body content" section which can be the focus of the output. The code is then stripped back to just the essential article content, and then converted to markdown. To deploy it you’ll need an AWS account, and to have the serverless framework installed. The process is designed to be executed from within the node script, which involves spawning it in a child process.

Company Mentioned

featured image - How To Convert HTML to Markdown with a Serverless Function

Outlined below is the setup for a AWS lambda function which combines
fetching the HTML for a URL, stripping it back to just the essential
article content, and then converting it to Markdown. To deploy it you’ll
need an AWS account, and to have the installed.

Step 1 - Download the full HTML for the URL

First get the full html of the url getting converted. As this is
running in a lambda function I decided to try out an ultra-lightweight
node http client called (which is 95% smaller than my usual favourite Axios):

const phin = require('phin')
const fetchPageHtml  async fetchUrl => {
  const response = await phin(fetchUrl)
  return response.body;
};

Step 2 - Convert to readable HTML

Converting to readable HTML is a feature originally offered by Instapaper (going back to 2008) as part of the core experience of a "read it later" service, but is now built into most browsers. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way. This process won't work for every web page - it is designed for blog posts, news articles etc which have a clear "body content" section which can be the focus of the output.Mozilla have open sourced their code for doing this in a library, which can be reused here:

const readability = require("readability");
const JSDOM = require("jsdom").JSDOM;

const extractMainContent = (pageHtml, url) => {
  const doc = new JSDOM(pageHtml, {
    url,
  });
  const reader = new Readability(doc.window.document);
  const article = reader.parse();
  return article.content;
};

This returns the HTML for just the article in a more readable form.

Step 3 - Convert readable HTML to markdown

There is a CLI tool called which converts HTML to markdown. The elevator pitch for pandoc is:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

To try this out locally before running it from the lambda function, you can follow one of their , and then test it from the command line by piping a html file as the input:

cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none

The options used here are:

```
-f html
```
is the input format
```
-t commonmark
```
is the output format (a particular markdown flavour)

You can add extra configuration options to the output by adding them to the output name.

commonmark-raw_html+backtick_code_block

sets the converter to disable the

raw_html

extension, so no plain html is included in the output.

It enables the

backtick_code_blocks

extension so that any code blocks are fenced with backticks rather than being indented.

The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the html to the child

stdin

and then collect the markdown output via the child

stdout

Most of these functions have been taken from on working with stdout and stdin in nodejs.

First off this is the generic streamWrite function, which allows you to pipe the html to the pandoc process, by writing to the

stdin

stream of the child process.

const streamWrite = async (stream, chunk, encoding = 'utf8') =>
  new Promise((resolve, reject) => {
    const errListener = (err) => {
      stream.removeListener('error', errListener);
      reject(err);
    };
    stream.addListener('error', errListener);
    const callback = () => {
      stream.removeListener('error', errListener);
      resolve(undefined);
    };
    stream.write(chunk, encoding, callback);
  });

This similar function reads from the

stdout

stream of the child process, so you can collect the markdown that is output:

const {chunksToLinesAsync, chomp} = require('@rauschma/stringio');
const collectFromReadable = async (readable) => {
  let lines = [];
 for await (const line of chunksToLinesAsync(readable)) {
   lines.push(chomp(line));
 }
 return lines;
}

Finally this helper function converts the callback events for the child process into an “awaitable” async function:

const onExit = async (childProcess) =>
  new Promise((resolve, reject) => {
    childProcess.once('exit', (code) => {
      if (code === 0) {
        resolve(undefined);
      } else {
        reject(new Error('Exit with error code: '+code));
      }
    });
    childProcess.once('error', (err) => {
      reject(err);
    });
  });

To make the API a bit cleaner, here is that all wrapped up in a single helper function:

// spawns a child process, supplying stdin to the child STDIN, then reads from the child STDOUT and
// returns this as a string
const spawnHelper = async (command, stdin) => {
  const commandParts = command.split(" ");
  const childProcess = spawn(commandParts[0], commandParts.slice(1))
  await streamWrite(childProcess.stdin, stdin);
  childProcess.stdin.end();
  const outputLines = await collectFromReadable(childProcess.stdout);
  await onExit(childProcess);
  return outputLines.join("\n");
}

This makes calling pandoc from the node script much simpler:

const convertToMarkdown = async (html) => {
  const convertedOutput = await spawnHelper('/opt/bin/pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none', html)
  return convertedOutput;
}

To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a
precompiled pandoc binary. You can , or just include the in your serverless config.

# function config
layers:
  - arn:aws:lambda:us-east-1:5:layer:pandoc:1

Step 4 - Wrapping this up in the lambda handler function

Export a function from this module which has been configured as the
handler. This is the function AWS will run every time the lambda
receives a request.

module.exports.endpoint = async (event) => {
  const url = event.body
  const pageHtml = await fetchPageHtml(url);
  const article = await extractMainContent(pageHtml, url);
  const bodyMarkdown = await convertToMarkdown(article.content);
  // add the title and source url to the top of the markdown
  const markdown = `# ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}`
  return {
    statusCode: 200,
    body: markdown,
    headers: {
      'Content-type': 'text/markdown'
    }
  }
}

This is the full

serverless.yml

configuration that is needed for serverless to deploy everything:

service: url-to-markdown

frameworkVersion: ">=1.1.0 <2.0.0"

provider:
  name: aws
  runtime: nodejs12.x
  region: us-east-1

functions:
  downloadAndConvert:
    handler: handler.endpoint
    timeout: 10
    layers:
      - arn:aws:lambda:us-east-1:5:layer:pandoc:1
    events:
      - http:
          path: convert
          method: post

Wrap Up

The full source code is . Once deployed you can test it from the command line like so:

curl -X POST -d '//www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska' //zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert
```

Previously published at

L O A D I N G
. . . comments & more!

About Author

Michael Vigor@michaelvigor

Read my stories

TOPICS

programming #serverless #aws-lambda #markdown #aws #html #javascript #nodejs #software-development

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas