visit
In this article, we will explore how to create, build and deploy every component behind this Bikes recommendation website: 🔗
The idea behind this project is to test the opportunity to build a recommendation system using public data, unsupervised machine learning (ML) models, and only free resources.
To achieve this we will:
Garrascobike is a mountain bike () recommendation system. In other words, you could choose a bike brand or model that you like and then the system will suggest 3 bikes considered interesting and related to your chosen bike.
The idea behind the recommendation system is: when people talk about some Bikes on the same subreddit thread those bikes should be related in some way. So we could extract the Bike’s names and/or brands from one thread’s comments and intersect that information with other Reddit threads with similar bikes.
The goal of this guide is to chase all the aspects involved in the creation of a Web App that serves a simple recommendation system, trying to keep the complexity level lower as possible.
So the technical level of this experiment won’t be too deep and we don’t follow industrial-level best practices, nevertheless, this is the guide I would like to have one year ago before starting a simple project: create a WebApp with an ML model at is core.
Download text comments from Reddit 🐍
Extract interesting entities from the comments 🐍🤖
Create a simple recommendation system model 🐍🤖
Deploy the model on a back-end 🐍
Create a front-end that expose the model predictions 🌐
🐍 = blocks that use Python
🤖 = blocks with Machine Learning topics involved
🌐 = blocks that use HTML, CSS & Javascript
First of all: we need the data and Reddit is an amazing social network where people talk about any topic. Moreover, expose some API that python packages like could use to scrape the data.
In Garrascobike we want to build a Mountain Bike recommendation system, so we need subreddits that talk about those bikes.
Contain all the information about the submissions: a submission is a post that appears in each subreddit
e.g. from a text like “I have a 2019 Enduro” we would like to extract something like "2019” is a DATE and "Enduro” is a PRODUCT
So this block get the Reddit comments and will extract the Bikes names contained within them (extraction.parquet)
Extract the entities
/data/01_subreddit_extractions/
/garrascobike/01_entities_extraction.py
Under the hood the script will use system with a ML model
If you aren’t rich like me there is a big chance that you haven’t a powerful GPU for ML tasks, so use the above notebook and leverage the Google GPU’s for free 💘
Load on Elasticsearch
Using ES on this scouting phase was very helpful to data discovery and express queries like “how many threads exist with at least 1 product entity”, “get all the entities typer” or “list all the PRODUCT entities”
Following to run a mini-cluster locally, the 01_single-node
cluster is suggested
Run the script /garrascobike/02_es_uploader.py
script providing the parquet file and the ES endpoints
# folder garrascobike-core
$ python garrascobike/02_es_uploader.py --es_index my_index \
--es_host //localhost \
--es_port 9200 \
--input_file ./data/02_entities_extractions/extraction.parquet
The parsed comments are now associated with text snippets named product from the entities extraction process
After a look at those snippets, we could see that the product entities here are the Bikes names and brands that users on subreddits talk about
🦂 Of course there is also noise labelled as product, e.g. “canyon/giant” is a snippet that contains two brands and should not be included and/or split into two entities “canyon” and “giant”. Moreover, “12speed” isn’t a brand but the number of gears of the bike.
We will create a semifinished product: correlations.npz (new code version store a file named presences.csv), this file contain the product entities found for each subreddit thread
In the last step, 04_recommendation_trainer.py will train ad store an AI model based on the KNN algorithm able to make the bikes recommendations process
Finally, we will upload by hand (no automatic script provided) the files create by 04_recommendation_trainer.py
Run the 03_correlation_extraction.py script, parameters:
es_host: the Elasticsearch instance address
es_port: the Elasticsearch instance port
es_index_list: the Elasticsearch indexes names. They could be more than one because it’s possible to join more entities extraction. However, for this guide, we could use only one parameter: my_index.
$ python garrascobike/03_correlation_extraction.py --es_host localhost \
--es_port 9200 \
--es_index_list my_index
Run the 04_recommendation_trainer.py script, parameters:
$ python garrascobike/04_recommendation_trainer.py --presence_data_path ./data/03_correlation_data/presence_dataset/202/presences.csv \
--output_path ./data/04_recommendation_models/knn/
/recommender/{bike_name}
that take a bike name and return 3 bikes suggested/brands/available
that return the list of supported bike names/health
that returns a timestamp and will be used to check if the back-end is up and in running stateWe need the list of bikes that the recommendation system model can manage: brands.csv
To set the Backblaze bucket name and path modify the file at: /garrascobike_be/ml_model/hosted-model-info.json
To set the Backblaze connection credentials modify and rename to .env
the file: /garrascobike_be/.env_example
The back-end openAPI specification could be found under //<heroku-app-url>/docs
and will look like this:
live-server
program to run the website locally -
**📧 Found an error or have a question? let’s [connect](//www.pistocop.dev/)**
This article was first published here: