# Using GPGPU Computing to Generate Normalized Latin Squares

## Introduction

Many problems in combinatorics pertaining to mutually orthogonal Latin squares (MOLS) require the efficient generation of (at least) a subset of all normalized Latin squares (NLS). This is because the search space for MOLS can be restricted by allowing the first square in the set of MOLS to have 1, 2, …, n for the first row and column of the square, where n is the order of the Latin square. Moreover, the subsequent squares are allowed to have the first row remain ordered, i.e. 1, 2, …, n, as well. Therefore, having all normalized Latin squares of order n allows a researcher to quickly generate all possible squares that need to be checked for orthogonality.

Most algorithms to date have been implemented and executed on CPUs (Central Processing Units). With the recent surge in many scientific studies, such as Artificial Intelligence and Computer Graphics, there has been a push to move as much computation as possible to the GPU (Graphics Processing Unit) due mostly to its large number of cores. For example, the CPU in the machine I’m writing this post on has 8 real and 16 logical cores while the GPU offers 4,352 CUDA compute cores. Although these cores are not as general-purpose as those on the CPU many computational tasks can be sent to the GPU to increase performance.

In this article, I show how to use General-Purpose GPU (GPGPU) computing, which is performing computations on the GPU that are typically done on a CPU, to generate every NLS of any order provided the isotopy class representatives. If you aren’t familiar with Latin squares, don’t worry, there is still knowledge of GPGPU computing to be gained from reading this post. If you’re interested in Latin squares the Wikipedia page is an excellent place to start learning.

## The Algorithm

The algorithm I used to generate the NLS actually generates all Latin squares of a particular order provided the isotopy class representatives. This raises some questions. First and foremost, if I’m generating all of the squares why not just keep all of the squares? The answer is twofold. First, there are far fewer NLS than there are LS of any order so it’s more memory efficient while the process is running and storage efficient when the results are written to a file to only worry about the NLS. Second, my primary goal is to work with MOLS. Thusly, as mentioned above, I really only care about the NLS. The other squares in my ‘orthogonality-checking universe’ can easily be found by permuting the last three rows of the NLS.

The LS can be generated via isotopy class equivalence by permuting the rows, columns, and symbols of the isotopy class representative and each square generated along the way. These permutations are the process of swapping rows in the square, e.g. row 1 becomes row 2 and row 2 becomes row 1, swapping the columns of the square and swapping the symbols, e.g. all 2’s in a square become 3’s and all 3’s become 2’s, for each of the n! permutations of the symbols (where, again, n is the order of the LS).

Performing these permutations on the isotopy class representatives creates a new batch of squares. These squares are normalized then filtered to remove duplicates and squares that have already been ‘checked’ (i.e. squares that have had the permutations applied to them). Then these new ‘unchecked’ squares are permuted and filtered creating yet another new set of squares. This process is done until no new squares are generated.

## Implementation

The algorithm was implemented on a RTX 2080 TI GPU using Nvidia’s CUDA API.

### Issues

A couple of issues arose when implementing the GPGPU Latin square generator. First and foremost, everything that is computed on the GPU has to be pretty “old-fashioned” in the sense that it is much easier to use primitive objects (arrays) rather than classes (even STL classes) and at times it is sometimes impossible to use the latter. Moreover, much of the computation has to be done on 1D arrays which require flattening 2D arrays or vectors and passing in a lot of information about the 1D array so the 2D objects (the squares) can be retrieved from them. Because of this, the Latin square class I implemented to run these algorithms on a CPU does all of its processing on 1D arrays so the transition was fairly painless.

The second issue was that of batching. The GPU used for this experiment has 4352 CUDA cores. For some orders of Latin squares (primarily orders 6+), the number of squares being checked can become much larger than this. Additionally, storing all of the squares that were generated before being able to filter the duplicates takes up a lot of RAM. Therefore, batching was implemented so the processing was chunked into a user-specified number of squares. This batch size was used until all the squares of a certain order were checked.

### Device Code

There are four functions run on the CUDA enabled device to implement the bulk of the square generation. The rest of the logic that is executed on the CPU primarily deals with batching, filtering, and translating between more complex objects and arrays.

There are three device functions that are prefaced with the __device__ qualifier:

#### Permute Rows

```__device__ void permute_rows(short* new_rows, short* values, short* newSquares, int order, int rowOffset, int myOffset)
{
for(short i = 0; i < order; i++)
{
for(short j = 0; j < order; j++)
{
newSquares[(i * order) + myOffset + rowOffset + j] = values[new_rows[i] * order + j];
}
}
}```

As the function name suggests, this function takes in a permutation (new_rows), the current squares (values), the array to hold the new squares being generated (newSquares) and some information to help set the values. The result is a new entry in the newSquares array that is the row permutation corresponding to the current permutation vector.

#### Permute Columns

```__device__ void permute_cols(short* new_cols, short* values, short* newSquares, int order, int colOffset, int myOffset)
{
for(short i = 0; i < order; i++)
{
for(short j = 0; j < order; j++)
{
newSquares[(i + myOffset + colOffset) + (j * order)] = values[j * order + new_cols[i]];
}
}
}```

In keeping up with the permute_rows function, this function does essentially the same thing except produces a new Latin square representing the column permutation of the current permutation.

#### Permute Symbols

```__device__ void permute_symbols(short* syms, short* values, short* newSquares, int order, int symOffset, int myOffset)
{
short osq = order*order;
for(short i = 0; i < osq; i++)
{
newSquares[i + myOffset + symOffset] = syms[values[i]];
}
}```

The permute_symbols function does just that, given a vector of size n the function creates a new Latin square in the newSquares array that represents the symbol permutation of the current square and permutation.

#### Generate Squares

The next function is prefaced with the __global__ qualifier which makes it accessible from the host (typically CPU). This is the entry point for the GPU logic and handles much of the administrative processing behind generating the new Latin squares.

```__global__ void generate_squares(short* squareList, int order, short* newSquares, short* perms, int batchSize, int totalPerms)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < batchSize)
{
int osq = order*order;
int myOffset = idx * totalPerms * osq * 3; // where in the new square list is this thread's data?
int squareOffset = idx * osq; // where in the squares list is this thread's data?

// where after the offset to we start storing the data in the new square list
int rowOffset = 0;
int colOffset = osq;
int symOffset = 2*(osq);
short* my_square = (short*)malloc(sizeof(short) * osq);	// add squareOffset to function calls
short* perm = (short*)malloc(sizeof(short)*order);

for(int i = 0; i < osq; i++)
{
my_square[i] = squareList[i + squareOffset];
}

for(int pCount = 0; pCount < totalPerms; pCount++)
{
for(int i = 0; i < order; i++)
{
perm[i] = perms[(pCount*order) + i];
}
permute_cols(perm, my_square, newSquares, order, colOffset, myOffset);
permute_rows(perm, my_square, newSquares, order, rowOffset, myOffset);
permute_symbols(perm, my_square, newSquares, order, symOffset, myOffset);
myOffset += (osq*3);
}
delete[] my_square;			//!!!! ALWAYS FREE YOUR MEMORY !!!!!
delete[] perm;
}
}```

The generate_squares(…) function first retrieves the core’s square from the array containing all of the squares. Then the other three GPU functions are called to create three new LS by permuting the core’s square’s rows, columns, and symbols. The new squares are stored in a separate array that is (obviously) three times the size of the original square array. This is done for each square passed to the function.

### Host Functionality

There are a few other functions included in the full source code below. These functions are largely supplementary to the GPU logic which actually does the square processing. The operations performed by the host are: handling GPU memory and starting the GPU process, copying the squares from arrays to vectors to utilize some STL functionality, creating batches of squares to be processed, and filtering out non-unique squares.

### Future Work

The algorithm has been tested on orders 3-6 with good results. Going forward, since the number of Latin squares grows combinatorially, I plan on implementing efficient storage of the squares as well as file I/O to store the squares while running the algorithm. This should prevent much of the memory overhead as having too many Latin squares can eat up a lot of RAM. This should be fairly easy since I’m only storing small integer values that can be stored efficiently in binary files which should also offer improved read and write times.

## Full Code

```#include "../GenerateSquares.h"

#include <stdio.h>

using namespace std;

__device__ void permute_rows(short* new_rows, short* values, short* newSquares,
int order, int rowOffset, int myOffset)
{
for(short i = 0; i < order; i++)
{
for(short j = 0; j < order; j++)
{
newSquares[(i * order) + myOffset + rowOffset + j] = values[new_rows[i] * order + j];
}
}
}

__device__ void permute_cols(short* new_cols, short* values, short* newSquares,
int order, int colOffset, int myOffset)
{
for(short i = 0; i < order; i++)
{
for(short j = 0; j < order; j++)
{
newSquares[(i + myOffset + colOffset) + (j * order)] = values[j * order + new_cols[i]];
}
}
}

__device__ void permute_symbols(short* syms, short* values, short* newSquares,
int order, int symOffset, int myOffset)
{
short osq = order*order;
for(short i = 0; i < osq; i++)
{
newSquares[i + myOffset + symOffset] = syms[values[i]];
}
}

__global__ void generate_squares(short* squareList, int order, short* newSquares,
short* perms, int batchSize, int totalPerms)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < batchSize)
{
int osq = order*order;
int myOffset = idx * totalPerms * osq * 3; // where in the new square list is this thread's data?
int squareOffset = idx * osq; // where in the squares list is this thread's data?

// where after the offset to we start storing the data in the new square list
int rowOffset = 0;
int colOffset = osq;
int symOffset = 2*(osq);
short* my_square = (short*)malloc(sizeof(short) * osq);	// add squareOffset to function calls
short* perm = (short*)malloc(sizeof(short)*order);

for(int i = 0; i < osq; i++)
{
my_square[i] = squareList[i + squareOffset];
}

for(int pCount = 0; pCount < totalPerms; pCount++)
{
for(int i = 0; i < order; i++)
{
perm[i] = perms[(pCount*order) + i];
}
permute_cols(perm, my_square, newSquares, order, colOffset, myOffset);
permute_rows(perm, my_square, newSquares, order, rowOffset, myOffset);
permute_symbols(perm, my_square, newSquares, order, symOffset, myOffset);
myOffset += (osq*3);
}
delete[] my_square;			//!!!! ALWAYS FREE YOUR MEMORY !!!!!
delete[] perm;
}
}

void run_on_gpu(short* squaresToRun, int order, short* newSquares, short* perm,
int squareArraySize, int permArraySize, int newSquareArraySize,
int squaresToCheck, int totalPerms, short *dev_squares, short* dev_perm, short* dev_new_squares)
{
cudaMemcpy(dev_squares, squaresToRun, squareArraySize, cudaMemcpyHostToDevice);
cudaMemcpy(dev_perm, perm, permArraySize, cudaMemcpyHostToDevice);
cudaMemcpy(dev_new_squares, newSquares, newSquareArraySize, cudaMemcpyHostToDevice);

// how many blocks do we need if we use nThreads threads?
generate_squares<<<nBlocks, nThreads>>>(dev_squares, order, dev_new_squares, dev_perm, squaresToCheck, totalPerms);

cudaMemcpy(newSquares, dev_new_squares, newSquareArraySize, cudaMemcpyDeviceToHost);
}

void copy_to_vectors(short* newSquares, vector<LatinSquare> &checkSqs,
vector<LatinSquare> &appendToSquares, int numberSquares, int order, bool updateCheckSquares,
int totalPerms)
{
int osq = order*order;
long iterRange =  numberSquares*totalPerms*3*osq;
short valuesIdx;
short* values;
for(long i = 0; i < iterRange; i++)
{
if(i % osq == 0)
{
if(i > 0)
{
LatinSquare sq = LatinSquare(order, values);
}
valuesIdx = 0;
values = (short*)malloc(sizeof(short)*osq);
}
values[valuesIdx] = newSquares[i];
valuesIdx++;
}
}

int main(int argc, char* argv[])
{

// timers
clock_t start, end;
start = clock();

// TODO: make some things globals (e.g. order, osq) to stop passing it around
if (argc < 3)
{
print_usage();
return 0;
}

short order = stoi(string(argv[1]));
short osq = order*order;
string filename_iso = string(argv[2]);
string filename_3 = "3_perm.dat";
string filename_n = to_string(order) + "_perm.dat";
bool cont = true;

if (!file_exists(filename_n))
{
cout << filename_n << " does not exist. Please use the utilites to generate the file." << endl;
cont = false;
}
if(!file_exists(filename_iso))
{
cout << filename_iso << " does not exist." << endl;
cont = false;
}

if (!cont)
return 0;

end = clock();
double paramTime = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA Param Time Taken: " << paramTime << " seconds" << endl;

start = clock();

ifstream isofile; isofile.open(filename_iso);
ofstream sqfile; sqfile.open(to_string(order) + "_squares.dat");

string line;
vector<LatinSquare> allSqs;
vector<LatinSquare> checkSqs;		// squares to permute, do not permute all squares everytime
while(getline(isofile, line))
{
LatinSquare isoSq(order, get_array_from_line(line, osq));
allSqs.push_back(isoSq);
checkSqs.push_back(isoSq);
}
isofile.close();

long totalPerms = my_factorial(order);
short* perms = (short*)malloc(sizeof(short) * totalPerms * order);
vector<short*> permVec;
ifstream permfile; permfile.open(filename_n);
string permline;
int count = 0;
while(getline(permfile, permline))
{
short* permArr = get_array_from_line(permline, order);
permVec.push_back(permArr);
for(int i = 0; i < order; i++)
{
perms[(count*order) + i] = permArr[i];
}
count++;
}
permfile.close();

end = clock();
double fileTime = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA FILE Read Time Taken: " << fileTime << " seconds" << endl;

// some random value (maybe keep it divisible by nThreads which should be a multiple of 32)
int maxBatchSize = 2048;
// 4352; // number of cores? (sure, although it might eat ram)
long unsigned int numSqs;
int permArraySize = order * sizeof(short) * totalPerms;
int lastSquaresToCheck = 0;
short* dev_squares; short* dev_perm; short* dev_new_squares;
do {
numSqs = allSqs.size();
vector<LatinSquare> newSqVec;

// TODO: add a permutation batch
int checkedSquares = 0;

while(checkedSquares < checkSqs.size())
{
if(checkedSquares % (maxBatchSize * 3) == 0 && checkedSquares > 0)
{
printf("Checked %d out of %ld squares\n", checkedSquares, checkSqs.size());
}
// only process up to maxBatchSize, in batches, to conserve RAM
int squaresToCheck = (checkSqs.size() - checkedSquares) > maxBatchSize
? maxBatchSize : (checkSqs.size() - checkedSquares);
int squareArraySize = squaresToCheck * osq * sizeof(short);
int newSquareArraySize = squareArraySize * totalPerms * 3;

short* squares = (short*)malloc(squareArraySize);
short* newSquares = (short*)malloc(newSquareArraySize);

for(int i = 0; i < squaresToCheck; i++) 	// each square
{
// start at the last index (ignore first squares that have been checked)
short* values = checkSqs.at(checkedSquares + i).get_values();
for(int j = 0; j < osq; j++)
{
squares[(i*osq) + j] = values[j];
}
}

start = clock();
if(lastSquaresToCheck != squaresToCheck)
{
if(lastSquaresToCheck > 0)
{
cudaFree(dev_squares);
cudaFree(dev_perm);
cudaFree(dev_new_squares);
}
cudaMalloc((void**)&dev_squares, squareArraySize);
cudaMalloc((void**)&dev_perm, permArraySize);
cudaMalloc((void**)&dev_new_squares, newSquareArraySize);
}
end = clock();
double cudaMallocTime = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA Malloc Time Taken: " << cudaMallocTime << " seconds" << endl;

start = clock();
run_on_gpu(squares, order, newSquares, perms, squareArraySize,
permArraySize, newSquareArraySize, squaresToCheck, totalPerms,
dev_squares, dev_perm, dev_new_squares);
end = clock();
double gpuRunTime = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA GPU Run Time Taken: " << gpuRunTime << " seconds" << endl;

// need to store newSqVec here instead so that we can only add
// new unique squares to the checkSqs vector
start = clock();
copy_to_vectors(newSquares, checkSqs, newSqVec, squaresToCheck, order, false, totalPerms);
end = clock();
double copyTime = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA Copy Time Taken: " << copyTime << " seconds" << endl;

checkedSquares += squaresToCheck;
lastSquaresToCheck = squaresToCheck;

delete[] squares;
delete[] newSquares;
}

checkSqs.clear(); // just in case
for(auto it = newSqVec.begin(); it != newSqVec.end(); it++)
{
}

cout << "Start Count: " << numSqs << ", End Count: " << allSqs.size() << endl;
} while(numSqs < allSqs.size());

end = clock();
double timeTaken = double(end-start) / double(CLOCKS_PER_SEC);
cout << "CUDA Time Taken: " << timeTaken << " seconds" << endl;

for(auto it = allSqs.begin(); it != allSqs.end(); it++)
{
//cout << (*it) << endl << endl;
}

sqfile.close();
return 0;
}```