DataTables: A C++ Tabular Data Structure Project

This project’s GitHub can be found here.

Quick-Nav: Implementation, Installation, Examples, Future Work

For statistical programming languages or languages with good statistics processing libraries, the DataFrame is an essential structure. Most features of these languages and libraries (e.g. the R programming language or the Pandas package for Python), revolve around the DataFrame object which provides useful functionality for working with datasets. There has been a big push to incorporate this type of structure in C++ with a few open-source libraries on GitHub, especially the xtensor project which works to imitate NumPy tensors.

Although I’m sure these libraries are great, for the sake of learning by doing, I decided to create my own implementation of a data storage object in C++ to efficiently handle datasets. Initially, this functionality was part of a library I was creating (also for the sake of learning by doing) called YALL (Yet Another Learning Library) [the name was thought up independently but it’s not very original]. However, I found this functionality useful and, since it really can be a standalone project, decided to pull it out of YALL and put it in its own repository.

The goal of this project is to provide the basic functionality of Pandas or R DataFrames without too much bloat. Since this project is much smaller than those two implementations, initially, only the most commonly used functionalities will be incorporated into the DataTable object. Ideally, these tables will have a small computational footprint and remain memory efficient, i.e. there will be a minimal amount of metadata so that using the DataTable class doesn’t decimate your RAM. In this post, I will discuss the first version of the DataTable class, which incorporates some limited functionality, by walking through the current implementation. Note that this is an initial version, the project is just getting started, and I plan on adding much-needed functionality and (code/project) quality in the future.

Implementation

DataTables was written in C++ and tested on Ubuntu 19.04 and Netrunner 19.08, both Debian-based distributions. CMake is used for the build process so the project should be cross-platform, however, the Mac and Windows versions have not been tested.

Due to the length of the code for the function implementations (>600 lines), I won’t share all of it here. All of the project files can be found on my GitHub, in particular the function implementations. What I will go over here is, essentially, the header file that defines the DataTable class. I will walk through creating data tables, operating on them, reading and writing data, and viewing the data in the table to give some idea on how the structure is intended to be used. At times I will also try to draw parallels between Pandas and R DataFrames just for additional context.

Creating Tables

Below are the various constructors for the DataTable class:

// load no data; do nothing
DataTable();

// load data from a CSV, not necessarily with a specified response
DataTable(std::string csv_file_name, std::string response_column="", bool has_headers=true);

// load data from an array, specify response name
DataTable(std::string* headers, std::string response_name, double** data, int nrows, int ncols, bool has_headers=true);

// load data from an array, specify response column number
DataTable(std::string* headers, int response_column, double** data, int nrows, int ncols, bool has_headers=true);

As seen here, the DataTable class only handles numeric data. The class was primarily designed for use in machine learning algorithms, many of which are built on operations in linear algebra that only operate on numeric data. Conversion from qualitative to quantitative columns is left to the class’ user.

Arguably the most useful constructor is the second one listed which loads data from a CSV file. In using the DataTable class I’ve found the other constructors particularly useful for copying DataTables since they are faster (don’t need to read the CSV) and the data for the constructors can be taken straight from the DataTable class.

Modifying Data

There are a few operations to modify the data in a DataTable. Again, these were implemented with machine learning applications in mind. Some things a person might want to do in that realm are shuffle data, remove columns from the dataset (train/test splitting), and remove rows from the dataset (train/test splitting). I considered these operations paramount for deep learning/data science applications so they were the initial operations added to the class. The function declarations are shown below.

void drop_columns(int* columns, int count);
void drop_columns(std::string* column_names, int count);
void drop_rows(int* rows, int count);
void shuffle_rows(int passes=100);

The DataTable class offers two methods to remove columns from the dataset: the first by column name and the second by column position. There is one method to remove some rows from the dataset by position only. One implementation of a train/test split with DataTables would be to create two identical data tables, generate two sets of random row numbers (one for test and one for train), and drop one set of rows from one DataTable and the other set of rows from the other DataTable; a simple method is provided below. Lastly, there is a method to shuffle the rows of the dataset which is helpful in randomly selecting training and testing data.

Viewing Data

Personally, when using tabular data structures from other programming languages (e.g. Python/Pandas and R) I’ve found it useful to be able to view a particular column or row, view the columns in a dataset, and determine the ‘shape’ of a dataset (i.e. the number of columns and rows). Thusly, I’ve implemented features in the DataTable class that allow this type of functionality.

friend std::ostream& operator<<(std::ostream& os, const DataTable &table);
void print(std::ostream& stream);
void print_column(std::ostream& stream, int column);
void print_column(std::ostream& stream, std::string column_name);
void print_row(std::ostream& stream, int row);
void print_headers(std::ostream& stream);
void print_shape(std::ostream& stream);

These function declarations are pretty straightforward. The DataTable class provides the functionality to view a column by name or position and a row by position. A user can also use the print_headers(…) method to view the column names as is done via pd.columns in Pandas. Another useful feature in Pandas is to view the number of rows and columns in a dataset, i.e. pd.DataFrame.shape, the DataTable class provides this functionality via the print_shape(…) method. Later, I will show how to actually retrieve the numeric values for the number of rows and columns in the dataset. Finally, the DataTable can be printed in two ways:

#include <iostream>
using namespace std;
...
datatable::DataTable dt(...)
dt.print(cout);    // method 1: print via the print method
cout << dt;          // method 2: print via the overloaded operator

both of which should print the headers of the dataset (if available) followed by one row of data for each row in the dataset. An error is displayed if the table is empty.

Selecting Data

Below is a list of the currently implemented functionality that allows the selection of data in the DataTable. Since there are so many methods I won’t go over them individually. However, the function names are pretty self-explanatory.

double** get_data();					
double* get_row(int row);		
double* get_column(int column);
double* get_column(std::string column_name);
double* get_response();
double** get_all_explanatory();
DataTable select_columns(int* column_numbers, int number_columns);
DataTable select_columns(std::string* variables, int number_cols);
DataTable select_rows(int* row_numbers, int number_rows);
DataTable top_n_rows(int n);
DataTable bottom_n_rows(int n);
DataTable select_row_range(int start, int end);
std::string get_header_at(int col);
std::string* get_headers();
std::string* get_explanatory_headers();

Reading and Writing Data

There are two methods provided for reading data from a file (from_csv(…)) and writing data to a file (to_file(…)). The declarations for these functions are below.

void from_csv(std::string filename, std::string response, bool has_headers=true);
void to_file(std::string filename, char delimiter=',');

One thing to note is the to_file(…) allows the user to set the delimiter for the output file in case the file needs to be TAB, colon or delimited in some other way. The from_csv(…) function allows data to be loaded into an empty DataTable from a CSV file, however, as shown above, there is also a constructor that loads data from a CSV file.

Misc.

There are a few functions that don’t really fit under the other headings above. These functions provide some meta-information about the dataset and the DataTable.

bool has_response();
int nrows() { return _rows; }
int ncols() { return _cols; }
int* shape() { return new int[2] { _rows, _cols }; }
int response_column_number() { return _response_column; }

The has_response() function returns true if a response column is set and false otherwise. Some functionality in the DataTable is inaccessible if there is no response column specified. nrows() and ncols() return the number of rows and columns in the dataset, respectively. This same information, number of rows and columns, is also returned as an array via the shape() function. Presently, the response_column_number() function is the only way to determine which column is the response column.

Installation and Usage

As mentioned above, CMake was used for the project’s build process. I’ve only tested the build/installation Debian-based Linux distributions but the project should build on other systems as well. To build and install the project the source code can be downloaded from the project’s GitHub repository. After you have the source, navigate to the source code’s root directory and create and navigate to a build directory. Then, using CMake, generate the appropriate files for your distribution. The instructions below should work for many Linux distributions. Note these instructions assume your current working directory is the source code’s root directory.

mkdir build
cd build
cmake ..
make  && sudo make install

With the project installed there is just one include needed to use the DataTable class.

#include <DataTable/DataTable.hpp>

The DataTable project lives in the datatable namespace. Below a DataTable is instantiated in two different ways, depending on your preference and the other libraries you might be using.

#include <DataTable/DataTable.hpp>
... 
datatable::DataTable dt(<params>);
#include <DataTable/DataTable.hpp>
using namespace datatable;
... 
DataTable dt(<params>);

Examples

Below are a few examples of using DataTables in C++.

Reading and Writing Data

  
#include <DataTable/DataTable.hpp>

#include <iostream>
using namespace std;

int main()
{
	datatable::DataTable dt("x_to_x_squared.csv", "x2", true);

	cout << dt << endl;
	dt.print_headers(cout);
	dt.print_column(cout, 0);
	dt.print_row(cout, 10);
	cout << dt.nrows() << ", " << dt.ncols() << endl;
	int* shape = dt.shape();
	cout << shape[0] << ", " << shape[1] << endl;
	dt.print_shape(cout);

	dt.to_file("same_but_dots.csv", '*');
}

The code snippet above demonstrates some basic functionality of DataTables. A CSV file is stored in the DataTable that has two columns, x and x2, where x2 is the first column multiplied by itself. A few rows are shown below.

x,x2
0.0,0.0
0.10101010101010101,0.010203040506070809
0.20202020202020202,0.040812162024283234
0.30303030303030304,0.09182736455463729
0.40404040404040403,0.16324864809713294
0.5050505050505051,0.25507601265177027
0.6060606060606061,0.36730945821854916
0.7070707070707071,0.49994898479746963
0.8080808080808081,0.6529945923885317
0.9090909090909091,0.8264462809917354
1.0101010101010102,1.020304050607081
1.1111111111111112,1.234567901234568
1.2121212121212122,1.4692378328741966
1.3131313131313131,1.7243138455259668
1.4141414141414141,1.9997959391898785

After reading this dataset, some metadata is displayed such as the dataset’s shape, the dataset’s headers, and a column of the dataset. Afterwards, the same data is written to a file but is now delimited by *‘s, as such

x*x2
0.10101*0.010203
0.20202*0.0408122
0.30303*0.0918274
0.40404*0.163249
0.505051*0.255076
0.606061*0.367309
0.707071*0.499949
0.808081*0.652995
0.909091*0.826446
1.0101*1.0203
1.11111*1.23457
1.21212*1.46924
1.31313*1.72431
1.41414*1.9998

Splitting Into Train/Test Datasets

One of the most common things to do when training machine learning models is to split the dataset into test and train or test, train, and validation datasets. This can easily be done with DataTables as demonstrated below.

#include <DataTable/DataTable.hpp>
#include <iostream>
#include <string>
using namespace std;

int main()
{
    // load the iris dataset (response = 'class')
    datatable::DataTable table("iris.data", "class");

    // randomize the rows of the DataTable
    table.shuffle_rows();

    // use ~80% of the data for training, the rest for testing
    int train_size = table.nrows() * 0.8;
    int test_size = table.nrows() - train_size;

    // grab the top 'train_size' rows for the training DataTable
    datatable::DataTable train = table.select_row_range(0, train_size);
    // similarly: datatable::DataTable train = table.top_n_rows(train_size);

    // select bottom (nrows-train_size) rows
    datatable::DataTable test = table.select_row_range(train_size, table.nrows());
    // similarly: datatable::DataTable test = table.bottom_n_rows(test_size);

    // print new table shapes to verify the split was done correctly
    train.print_shape(cout);
    test.print_shape(cout);
    table.print_shape(cout);

    // after the data is split we typically need to 'break-out' the response column
    // an example of this with the train data is shown below
    datatable::DataTable trainx(train.get_headers(), train.response_column_number(), train.get_data(), train.nrows(), train.ncols());
    trainx.drop_columns(new int[1]{trainx.response_column_number()}, 1);
    // similar: trainx.drop_columns(new string[1]{trainx.get_header_at(trainx.response_column_number())}, 1);

    datatable::DataTable trainy(train.get_headers(), train.response_column_number(), train.get_data(), train.nrows(), train.ncols());
    trainy.drop_columns(train.get_explanatory_headers(), (train.ncols() - 1));
}

Iterating a DataTable

Unlike many data types in the STL, there are no built-in iteration methods (no iterators). For the time being, a more primitive approach must be used to iterate the rows or columns of a DataTable. In future releases, I plan to look into better ways of iterating the dataset and more conveniently work with rows and columns of the DataTable.

#include <DataTable/DataTable.hpp>
#include <iostream>

using namespace std;
using namespace datatable;

int main()
{
    DataTable dt("x_to_x_squared.csv", "x2", true);
    // iterate columns
    for(int i = 0; i < dt.ncols(); i++)
    {
        double* col = dt.get_column(i);
        ...
    }

    // iterate rows 
    for(int i = 0; i < dt.nrows(); i++)
    {
        double* row = dt.get_row(i);
        ...
    }

    // iterate rows and columns 
    for(int i = 0; i < dt.nrows(); i++)
    {
        double* row = dt.get_row(i);
        for(int j = 0; j < dt.ncols(); j++) 
        {
            ... row[j] ... 
        }
    }
}

Conclusion and Future Work

The DataTable object discussed above is the initial version of this project. The functionality described provides just enough utility to make these objects valuable in some C++ machine learning and data science applications. In the future I plan on extending these objects via two methods: 1) using the DataTable data structure in various C++ projects and adding functionality that is useful to those projects (and likely useful to many others) and 2) implementing commonly used functionality in Pandas and R data frames. As a first step, in the coming weeks, I plan to implement the ability to extend a DataTable by allowing the appending of rows and columns to the dataset as part of a time-series forecasting project I am working on. Another immediately useful feature, which would make the train/test split example shorter, is an additional constructor that creates DataTables from another DataTable. The Issues on the GitHub repository layout some of the other things I am currently working on in this project or, if interested, contributions that need to be made (by anyone).

Leave a Reply

Your email address will not be published.