The problem we are addressing originates at a transportation company. When analyzing the results of their shipments, they would like to break this down into categories. For example, how many packages contain clothing and how many contain laptops, and does different content lead to different results during shipment. Instead of having to manually label all these packages, we look for a more efficient solution. The goal of this blog post is to show you the steps we took to solve this problem using machine learning. Steps that can be followed when you undertake your own machine learning projects.
The problem we were facing was an unlabeled data set. It contained a description of each item, and further information about the shipment. We wanted to categorize each line of data based on the package description. The resulting categories had to match with those of an earlier, already labelled data set.
Since we already had another, labelled data set, we turned to machine learning techniques to solve this challenge. Shortly put, we train a model on the labelled data set and use it to label the unlabeled data set.
The first step when dealing with textual data is always pre-processing. In this case we chose to do stemming and remove punctuation and stop words. This was done in python.
Once this had been done, we loaded the data into R. Here we started by dividing it into a test set and a training set. We used 80% of the data as a training set and 20% as a test set. This is a common ratio to split data into, but other variants such as 90%-10% are also used.
When dividing the data, we used stratified sampling. This means we took the categories of the data into account. So instead of just using 80% of all data, we used 80% of each category for the training set and the remaining 20% of each category for the test set. This is important because it ensures that all categories are represented equally in the test and the training data.
Building the right model
We then started using the R package RTextTools to test different machine learning algorithms on the data. The package facilitates the steps of turning your data in the right format and feeding it to the machine learning algorithms. It also aides with dividing the data and creating analytics describing the performance of the algorithms.
The algorithms we chose are those available to use in the package:
- GLMNET: generalized linear model via penalized maximum likelihood.
- MAXENT: maximum entropy classifier
- SLDA: scaled linear discriminant analysis
- SVM: support vector machine
In this step we ran into some problems regarding memory. Some of the machine learning techniques needed more memory than my laptop had to offer. To address this issue I did some of the experiments on the high performance computing infrastructure provided by the Flemish Supercomputer Centre (VSC). This gave us a chance to test algorithms that have higher memory requirements.
We then compared the results of each of the different algorithms to choose the most appropriate one. The statistics about the different algorithms can be seen in the following table:
We ended up choosing MAXENT as our algorithm of choice since it had very good scores and low enough memory requirements that I was able to run it locally. This made testing a lot easier.
Using the model on the unlabeled data
After a model has been chosen, we can now unleash it on the unlabeled data set. We do this by applying our model on the unlabeled data set. When apply our model to new data, we need to take into account the original vocabulary. We limit the terms(words) to those already available in the original data set. This is important because those are the terms that are recognized by the model. Any terms that only appear in the new data set will have no meaning for the model.
Another thing to note is that you have to be sure the pre-processing of both data sets is done in the same way. For example if you use stemming on your original data set, your model will learn to map “shoe” onto the shoes category. If you have not applied the same pre-processing steps on your new data, it will contain the term “shoes” which is not recognized by the model.
The output we get from applying the model to our data is labelled data. The confidence for each label is also shown in the output. Using this information we could then still set a threshold for this confidence value. For example we could map anything with a confidence of under 0.5 to “unlabeled” or another category, to make sure our analysis of the categories is not skewed by mislabeled categories.
To summarize the work we show the input:
And the resulting output:
To wrap things up, we once again present an overview of the steps to take when applying machine learning on a similar problem. The specifics of the problem here are that we are working with textual data and that we are training the model on a labelled data set to apply it on an unlabeled data set.
- Pre-process the textual data. Make sure to apply the same techniques to both data sets.
- Test different machine learning algorithms to find the most effective one.
- Train a model using the chosen algorithm
- Apply this model on the unlabeled data.