visit
Introduction
The following is a complete guide that will teach you how to create your own algorithmic trading bot that will make trades based on quarterly earnings reports (10-Q) filed to the SEC by publicly traded US companies. We will cover everything from downloading historical 10-Q filings, cleaning the text, and building your machine learning model. The training set for the ML model uses the text from these historical filings and the next-day price action after the filing as the feature label.We then download and clean the 10-Q filings from each day and use our trained model to make predictions on each filing. Depending on the prediction, we will automatically execute a commission-free trade on that company's ticker symbol.0. Our Project code:
1. Basic Dependencies: Python 3.4, Pandas, BeautifulSoup, yfinance, fuzzywuzzy, ntlk
Python is always my go-to language of choice for projects like these for the same reasons many other choose it- Fast development, readable syntax, and an awesome wealth of good quality libraries available for a huge range of tasks.Pandas is a classic for any data science project to store, manipulate, and analyze your data in dataframe tables.yfinance is a python library that is used to retrieve stock prices from Yahoo Finance.fuzzywuzzy provides fuzzy text similarity results for creating our 10-Q diffs.Ntlk allows us to split the text of the 10-Q reports on their sentences.2. Alpaca Commission-Free Trading API ()
I researched and tried several solutions for equity brokers that claim to offer APIs to retail traders. Alpaca was far and away the easiest to use with the clearest documentation. (I have no affiliation with them) Other brokers that I tested were:Interactive Brokers
Searching around the web, these guys seemed to have a reputation of being the "gold standard" in the space. However, upon testing their software it was a real eye-opener to the sad state of most retail trading APIs. It turned out they did not have an actual API as you'd expect, but instead a shockingly ancient desktop application for you to install and then libraries to automate the control of that desktop app. Their documentation was messy and convoluted. After a few attempts at running their example code and attempting some test trades, I could tell that using IB as a stable part of an algo trading bot would be possible, but a significant project in and of itself.Think-or-swim by TD Ameritrade
It was clear that ToS's API was much newer and more sensible to use than Interactive Broker's. However, it was also clear that it was not a matured solution. Although I did get it working, even the initial authentication process to use the API was strange and required undocumented information found on various forums to get it working. The trade execution APIs themselves appeared straightforward to use but the written documentation on them is extremely sparse.3. Google Cloud AutoML Natural Language
Google naturally has vast experience and investments in ML natural language processing due to the nature of their business as a search engine. After trialing several other ML techniques and solutions, Google's commercial solution produced the best accuracy of the model while providing a solution that was easy enough to use that this project would not get caught in an academic exercise of endless manual tuning and testing of various ML algorithms.Other ML libraries trialed: Initially I tried the following ML libraries along with creating a bag of bigrams from the filings text to use as the feature set: h2o.ai, Keras, auto-sklearn, and AWS Sagemaker.
A big challenge with this technique is that vectorizing the text from a bag of bigrams created a huge number of features for each data point of the training set. There are various techniques available to deal with this but predictive qualities may or may not be lost to varying degrees.4. Python-edgar: A neat little python library for bulk downloading lists of historical SEC filings ()
> python run.py -y 2010I chose to download nearly 10 year's worth (since 2010) to build our ML model with. You may download all the way back to 1993 or download less with a different year argument.Once this is finished we can compile the results into a single master file:
> cat *.tsv > master.tsvNow use quarterly-earnings-machine-learning-algo/download_raw_html.py to download all the raw html 10-Q filings that are listed in the index we just created:
> python download_raw_html.py path/to/master.tsv
This will take a significant amount of time to run as it will download many GB of data from the website. Each filing averages several dozen MB of data. When it finished we will have a folder
"./filings"
that contains the raw html of all of the filings.> python filing_cleaner.pyThis will use the html files from the "filings" directory created in the previous step and output into a "cleaning_filings" directory with cleaned text files.This does the following cleaning to prepare the text for natural language processing:
> python add_financial.pyThis reads filenames with the ticker symbols from the "cleaned_filings" directory created in the previous step and outputs a financials.pkl which is a Pandas dataframe containing all these next-day price changes for each filing.
Step 4. Produce Text Deltas of Each Quarterly Earnings From the Company's Last 10-Q Filing
In this step, we are going to take each cleaned quarterly earnings report and take sentence-by-sentence fuzzy diffs from the company's last 10-Q filing to remove text that also appeared in their last filing. This is an important step that strips away a huge amount of excess text and creates a clean report of what the company has added since the their last quarterly earnings report. This creates a beautifully clean signal to build our machine learning model on because only the information that the company has deemed important enough to add with their latest filing will be a part of out training data.Remember to use pip to install nltk and fuzzywuzzy dependencies before running.> python diff_cleaned_filings.pyThis command will take the cleaned text files from the "cleaned_filings" directory and output the text delta for each cleaned text file in the "whole_file_diffs" directory.
> python cloudml_prepare_local_csv.pyThis will output a file in the current directory called "training_data.csv" which is ready to be uploaded to Google.
> python MakeTrades.py <Alpaca API Key ID> <Alpaca Secret Key> <Google Model Name>This command will:1. Download the latest market day's 10-Q filings from the SEC website at This should only be run late on a market day since this is when all the filings will be available for that day. If you attempt to run it earlier it will give you yesterday's filings.2. Clean each 10-Q filing and diff it with the companies' last 10-Q filing, as we did in our training preparation. If the company did not have a 10-Q filed in the past 3 months or so it will skip it.3. Submit the text delta to do an online prediction with our ML model.4. If our model returns a prediction of 0 (it is predicting the most dramatic price drop category) then it will use the Alpaca API to put in a short order for that stock that will execute on the following day's market open.You should remember to close the short positions after they have been held for a day. You can write a script for this if you would like. You can also schedule this command with a cron job to be run at the end of each market day for complete automation.
Originally featured on :
I am available for software consulting for any of your company's projects. Click here to book a free initial consultation: