Using Machine Learning and NodeJS to detect the gender of Instagram Users

September 25, 2014 Reading time: 13 minutes

The goal of this article is to provide a very practical guide to deploying a machine learning solution at scale. Not everything is proven right or optimal, and as with any real-life deployment, we made some trade-offs and took some shortcuts on the go without necessarily building all the evidence that would have been required in an academic setting. We apologize for that, and we will try to clearly point out throughout the post the places where we did so and hope that it will be helpful to you nonetheless.

Let's start with a little bit of context: TOTEMS Analytics provides analytics on Instagram (audiences and communities around hashtags). Over the past year, we noticed an ever increasing need of our clients for demographics information on their Instagram audience which led us to decide 6 months ago to invest time to build a gender classifier based on social signals we could find on the platform (Instagram does not disclose demographics information of their users on their API). We invested 2 months of man.work in this project. After trying basic methods such as using name extraction and census (which yielded a 0.65 success rate, barely above the 0.5 success rate of a random classifier), we came up with a rather simple neural-network approach that now enables us to provide to our clients unique information that they can't find anywhere else. So we figured we'd share how we did it, so that you can also leverage simple machine learning techniques to enhance and differentiate your customers/users experiences.

To give you a more explicit idea of what we came up with, we've embedded the resulting classifier in a simple demo available right from this post. The classifier uses the hashtags used in the most recents posts of a user. Try it out now! It should work with most Instagram users (provided they're public and relatively active):

The constraints we had

Our platform retrieves or refreshes around 400 user profiles per second (this is managed using 4 high-bandwidth servers co-located with instagram's API servers on AWS). These profiles are stored in a sharded MySQL table and used to compute aggregated information about audiences (follower/followee relationships) or communities (contributors to a particular hashtag). This context led us to set the following prerequisites for the gender classifier we wanted to build:

  • Real-time classification:  We already knew the load we would have and it was important for the classifier to not drastically increase the amount of machines needed to process these profiles.
  • >0.9 success rate: This was our initial target. That number is quite arbitrary, but we figured the aggregated data computed based on a classifier with this kind of success rate would be good enough for all of our clients, and a good first milestone.

The training set

This is probably the most crucial part of the overall exercise. If you're building a classifier, it is likely because you're lacking the data you want to infer in the first place. Yet, you'll need a fairly big data set properly classified to use as a training set. In our case, we needed to find the gender of a large population of Instagram accounts. We knew that this gender information was readily available on other platforms, in particular Facebook[1] where most users provide their gender publicly. Focusing on Instagram profiles that included links to a facebook URL in their profile (of which there are many) was the way forward.

We built a few simple regular expressions to rule out anything that was not a proper Facebook profile URL and went over all the user ids we ever came across… and went even a little bit further by looking at their followers when relevant. Upon inspection of several hundred million Instagram users (most of the user base at that time), we were able to extract the Instagram username to Facebook profile relationships whenever relevant. The next step was to retrieve gender information on Facebook when available. With 570k profiles, we had our training set.

Extracting the proper input signals

Before building our neural network, we needed to get a better understanding of which input signal to use. Several approaches could have suited our goal here:

  • full-words from a user's posts caption and/or their bio and full-name
  • n-grams from a user's posts caption and/or their bio and full-name

The main problem here is that the input signal space is vast (set of all n-grams, or set of all words), and it is hard to design coherent fixed-sized input vectors (expected by most classifiers) that work on all profiles.

An interesting solution to that problem is to rely on mutual information to order the input space[2], and select the top N n-grams or full-words that have the greatest mutual-information[3] with the gender random variable. Intuitively, the mutual information between two random variables measures how much knowing one of these variables reduces uncertainty about the other. Here's how we computed the mutual information in our case:

We extracted the hashtags, full-words and n-grams used in the captions of recent posts from the users we had in our training set and ran the computation described above.

Out of curiosity, we ordered hashtags by the conditional probability of being classified as female given the use of a given hashtag. These are the hashtags most strongly associated with female users:

We also computed the hashtags most strongly associated with male users:

And finally, we generated the top 10k hashtags most mutually dependent with gender:

Using this list, we were then able to generate fixed-sized binary input vectors of arbitrary size (1k, 2k, 4k, 10k), representing the presence – or absence – of these most mutually dependent hashtags in the recent posts' captions of a user. We repeated this operation with n-grams and full words and left the evaluation of which input signal would be more efficient to after having built our neural network.

Intuitively, the conditional probability of being male or female could be seen as a more efficient marker than than mutual information (these lists are certainly more expressive), but mutual information takes into account the probability of finding the feature (see how top mutually dependent terms are more probable than conditional ones, their count is much higher). In other words, it's more efficient for a classifier to have a less strongly associated but much more probable term rather than very strongly associated but highly improbable terms. The latter will simply be useless for most of the users to be classified.

Building a neural network using nodeJS

We implemented our own neural network based on popular references[4][5] using nodeJS. We started with small training sets and increased progressively their size until we hit the limitations of Javascript's garbage collector (stopping the world too frequently) and rewrote it in C++ as a NodeJS native add-on (available here: https://github.com/totemstech/neuraln)

A neural network is a graph composed of layers. The first layer is set to the input vector value that needs to be classified. In our case, the presence or not of the top N most mutually dependent hashtags or n-grams in a user's recent posts. Each layer is linked to the other by weight values. Any node in a layer is linked to all the nodes in the next layer. There can be any number of layers between the input layer and the output layer, these are called inner-layers. Finally the outer layer represents the output vector whose dimension depends on value that needs to be inferred.

A neural network, is therefore defined by its layer structure, `layers_[l]` (number of node in each layer); and the weights value between each layer , `W_[l][i][j]`. Omitting other members, this is exactly how our neural network is defined:

Additionally, each node within the network is defined by its activation function. This function defines what a node outputs to the next layer given the sum of the inputs it received from the previous layer. It applies to all layers except the input layer whose value is set by the input vector. In our implementation we use the following code for our nodes' activation function:

Now that we have the proper data structure to represent our neural network, we need to be able to train it. The network we described is a feed-forward network: values are propagated from the input vector layer to the output layer, along the weights, using at each node its activation function to feed the next layer. Training of such networks relies on backpropagation[6].

Backpropagation alone deserves a whole blogpost[7] and is very well described in textbooks[4], so we suggest you refer to them if you want to understand the internals of it. What's important to remember is that backpropagation lets you update the weights of the neural network by propagating the error (difference between the outputed value and the expected value from the training set) along the gradient of the network. Computing that gradient, by the way, requires your activation function to be differentiable. Backpropagation was invented in 1970 but only popularized in 1986, enabling a much broader use of neural networks. It has since then become the go-to solution to train neural networks.

Using backpropagation, we can repeatedly iterate on the elements of the training set and reduce the error rate until it reaches the desired value. In our implementation, we compute the sum of the quadratic errors between the computed output `res` and the expected one `train[i]` and average it over the entire training set, and repeat that computation until it reaches the target value `error` passed as argument:

Choosing the best input signal

Once we had a functioning neural network and training algorithm, the next step was to test the different network structures and input signals to pick the most promising one. Here are the initial results we got:

  • `Type` is the input type (n-grams, words or hashtags) taken over the user's recent posts captions (early results had shown that using the bio and full-name was not yielding better results, contradictory to results in [2])
  • `L1,L2,L3` is the network structure, that is, the size of each layer. We restricted ourselves to 2- and 3-layer networks. `L1` also defines the size of the input vector, computed as described before using the features with the best mutual information.
  • `Cov` is an indicative value equal to the average number of features found on each elements of the training set (number of matching words/n-gram/hashtag on average)
  • `Err` is the target error rate for training. (Setting an exceedingly low value can cause the network to overfit, that is, become too specific to the training set)
  • `Train` is the size of the training set
  • `It` is the number of iterations over the training set needed to reach the target error rate
  • `Res` is the success rate on the test set (generally 10% the size of the training set). It's important to remember that the baseline is 0.5 not 0. A random function would be expected to perform at ~0.5 for gender classification over a random set of individuals.

It is worth noting here that networks without an inner layer (called perceptron), are linear classifiers. Their prediction is based on a simple linear combination of the input features. We were quite surprised by the efficiency of these networks given the simplicity of the model they encode.

Best results were obtained with networks with one hidden layer, and these results hinted us to the fact that using a large (10k) hashtag-based input vector would potentially yield the best results. Especially if we were able to use a larger training set.

Tweaking and Results

We were able to train a few networks using a 200k-element training set. The process became quite painful, as training on such large data sets would take up to 5h on the servers we had available. Using a 200k-element training set, we were able to increase the success rate to 0.83.

Finally, inspired by boosting[8], we experimented with training a male network and a female network using two distinct training sets of 200k elements each, and using both networks to classify, picking whichever network had the strongest response to the given input. This is a purely operational solution that does not root into anything else than the non-extensive experimental approach we described here. But it iss what worked best for us. Using this technique we were able to reach a success rate of 0.88 on the 70k-element test set, close enough to our initial target.

We added serialization (`to_string`) and deserialization (constructor) functions to our library and built the resulting classifier into our product. The nice property of feed-forward neural networks is that classification is really fast once the network is loaded in memory. Adding the classifier to our infrastructure had no visible impact on the load of our aggregation servers.

Analysis of the resulting Neural Networks

To prepare this post, we added a few scripts to generate SVG representations of the neural network serialized format in order to visualize the structure of the network. We ran these scripts on the networks (male and female) we have been using for the past few months in production.

You can see the result of this visualisation below (click to enlarge). We only display "heavy" links, that is links with an absolute weight above 0.8. Positive weights are in blue, negative weights are in red. The heavier the link is, the more colored it is (semi-transparent links are close to 0.8 while others have a higher absolute weight).

visu_nn_threshold_0_8

We were surprised by the simplicity of the resulting structure, but aware that lower weight links not displayed here might play a significant role, especially since it seems that a certain number of inner-layer nodes without any ingressing link in the visualization do have a significant output link.

The other surprising aspect, is the similarity of the two networks. They both have a mainly positive component feeding an inner-layer node with a positive impact on the result. And a mixed positive/negative component feeding an inner-layer node with a negative impact on the result. The former component seemingly playing the role of an enabler, and the latter component seemingly playing the role of an inhibitor.

Conclusion

We hope that this description of our experience deploying a machine learning solution in production will serve as a useful practical example to illustrate the numerous theoretical resources available online as well as in textbooks. The code we used to build and train our neural network, currently used in production at TOTEMS has been open-sourced on our company GitHub account: https://github.com/totemstech/neuraln. We hope it can serve in other production settings where a pure Javascript network implementation may lack the speed of a C++ implementation.

-stan


[1] Facebook Graph API https://developers.facebook.com/docs/graph-api/reference/v2.1/user
[2] Dicriminating gender on Twitter – The MITRE Corporation https://www.mitre.org/sites/default/files/pdf/11_0170.pdf
[3] Mutual information – Wikipedia http://en.wikipedia.org/wiki/Mutual_information
[4] Artificial Intelligence: A Modern Approach – S.Russel, P.Norvig http://www.amazon.com/Artificial-Intelligence-Modern-Approach-Edition/dp/0136042597
[5] Brain.js – @hartur https://github.com/harthur/brain
[6] Backpropagation – Wikipedia http://en.wikipedia.org/wiki/Backpropagation
[7] How the backpropagation algorithm works – M.Nielsen http://neuralnetworksanddeeplearning.com/chap2.html
[8] Boosting (Machine Learning) – Wikipedia http://en.wikipedia.org/wiki/Boosting_(machine_learning)