Retrieving historical stock data for analysis can be somewhat of a task. Many APIs that provide this information require some type of membership, account, or even fee before you have access to the data. Fortunately, Yahoo Finance offers the information free of charge on their site with the ability to download historic stock data. However, using the site to do this can become a hassle when you want to retrieve new data on the fly or when needing data for many different companies, EFTs, Index Funds, etc. In this post, I will show you how to use Python to create a class that can retrieve data for any ticker that Yahoo has data for. This can be done easily and on the fly for any number of tickers. In subsequent posts, I will be using this “API wrapper” for projects pertaining to stock price/time series simulation and analysis.
The API
I’m not familiar with any official APIs offered by Yahoo Finance but many people take advantage of the fact that historic data can be directly downloaded from the site. When downloading this data you’re essentially “navigating” to a different URL but rather than leaving the current page this URL will just download a file. We can take advantage of this using the Pandas library for Python.
The great thing about Pandas, besides almost everything, is the ability to pass into the data frame constructor anything that will result in a CSV file, this includes URLs that download CSVs. This is something taken advantage of in the following API.
The Class
We define a class to be used in other Python projects for the API and include a few libraries:
import datetime # for start and end periods import time import pandas as pd # to return a data frame with the stock data class YahooAPI(object): def __init__(self, interval="1d"): pass def __build_url(self, ticker, start_date, end_date): pass def get_ticker_data(self, ticker, start_date, end_date): pass
As seen above this API is going to be very simple and straightforward; its only intent is to retrieve historic stock data. The class’ init function takes one optional parameter to specify the interval in which we want the data returned. Valid options are 1d (one day), 1wk (one week), and 1m (1 month), by default we’ll just pull daily data.
The Functions
Now we’ll fill in the functions:
__init__(…)
def __init__(self, interval="1d"): self.base_url = "https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={start_time}&period2={end_time}&interval={interval}&events=history" self.interval = interval
The setup is simple: define the base URL and store the desired interval.
__build_url(…)
def __build_url(self, ticker, start_date, end_date): return self.base_url.format(ticker=ticker, start_time=start_date, end_time=end_date, interval=self.interval)
The private method __build_url(…) takes the ticker symbol of the stock we’re interested in retrieving data for, a start date, and an end date (correctly formatted) and builds the URL that can be used to get the stock data.
get_ticker_data(…)
def get_ticker_data(self, ticker, start_date, end_date): # must pass datetime into this function epoch_start = int(time.mktime(start_date.timetuple())) epoch_end = int(time.mktime(end_date.timetuple())) return pd.read_csv(self.__build_url(ticker, epoch_start, epoch_end))
The get_ticker_data(…) function is the access point to the API from the developer wanting the stock data. This function takes the ticker symbol and start and end dates as Python datetime objects. The start and end dates are transformed into the correct format for Yahoo (timestamps representing time since the Unix epoch). A Pandas data frame containing all of the historic stock data is returned from this function call. As seen here, the data frame creation is easily done by passing the URL for the CSV file into the Pandas data frame constructor.
Test
if __name__ == '__main__': dh = YahooAPI() now = datetime.datetime(2020, 6, 28) # get data up to 6/28/2020 then = datetime.datetime(2020, 1, 1) # get data from 01/01/2020 df = dh.get_ticker_data("msft", then, now) print(df)
Finally, the above snippet of code is used to test the API. For those unfamiliar, the code after the if statement if __name__ == ‘__main__’: in Python essentially only executes if this source code is being used as the main entry point to the program. That is, if I run this file standalone the logic will execute, otherwise it will not (e.g. if the code in this file is imported into another file). For a quick-and-dirty test, I’ve used this if statement to verify I am able to able to fetch data for Microsoft stock (MSFT). Running this logic produces the following results:
Date Open High ... Close Adj Close Volume 0 2020-01-02 158.779999 160.729996 ... 160.619995 159.737595 22622100 1 2020-01-03 158.320007 159.949997 ... 158.619995 157.748581 21116200 2 2020-01-06 157.080002 159.100006 ... 159.029999 158.156342 20813700 3 2020-01-07 159.320007 159.669998 ... 157.580002 156.714310 21634100 4 2020-01-08 158.929993 160.800003 ... 160.089996 159.210495 27746500 .. ... ... ... ... ... ... ... 118 2020-06-22 195.789993 200.759995 ... 200.570007 200.570007 32818900 119 2020-06-23 202.089996 203.949997 ... 201.910004 201.910004 30917400 120 2020-06-24 201.600006 203.250000 ... 197.839996 197.839996 36740600 121 2020-06-25 197.800003 200.610001 ... 200.339996 200.339996 27803900 122 2020-06-26 199.729996 199.889999 ... 196.330002 196.330002 54649200 [123 rows x 7 columns]
Conclusion
As seen above retrieving data from Yahoo Finance is very straightforward in Python. In under 20 lines of code we’ve developed the ability to get daily, weekly, or monthly data for any ticker symbol listed on Yahoo Finance. This data can be used for a plethora of applications including stock data analysis, training and testing machine learning algorithms, and developing stock trading bots. In future posts, I will use this API to build up portfolios of securities to work with some ideas in modern portfolio theory and stochastic process modeling.
Full Code
class YahooAPI(object): def __init__(self, interval="1d"): self.base_url = "https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={start_time}&period2={end_time}&interval={interval}&events=history" self.interval = interval def __build_url(self, ticker, start_date, end_date): return self.base_url.format(ticker=ticker, start_time=start_date, end_time=end_date, interval=self.interval) def get_ticker_data(self, ticker, start_date, end_date): # must pass datetime into this function epoch_start = int(time.mktime(start_date.timetuple())) epoch_end = int(time.mktime(end_date.timetuple())) return pd.read_csv(self.__build_url(ticker, epoch_start, epoch_end)) if __name__ == '__main__': dh = YahooAPI() df = dh.get_ticker_data("msft", "01/01/2020", "2020-01-31") print(df)
Thanks for putting this together. I was on the verge of writing something similar myself after discovering that pandas_datareader.data.DataReader pulls in the entire Yahoo finance page much like a web browser, which is a massive overhead, and then extracts the data from it. In contrast, this targets the exact data point.
With this now, overheads gone. Latency and throughput improved, traffic reduced. Well done.
I expanded your code a little bit to include support for requests.Session, which (in theory) allows some connection pooling for multiple requests and for proxy servers to be specified.
Thanks again!
One last thing (sorry!), to return the exact same results as pandas DataReader does (which a few people use), I’d include the datetime parsing and indexing in the get_ticker_data method:
data[‘Date’] = pd.to_datetime(data[‘Date’], format=’%Y-%m-%d’)
data.set_index(‘Date’, inplace=True)
Thanks for the comments.
That’s an excellent point. I’ve found myself using the API a few times now and having to convert the column to a datetime in the other scripts rather than the API. I’m actually not familiar with the DataReader in Pandas, although I have heard of it, so I wasn’t focused on being consistent with that but I think that would add a lot of convenience to my implementation.
Thanks again!