# How We Chose Our Senators

**July 12, 2016**

##### This is *Data Science Nuggets*, a new column applying concepts and techniques of data science to everyday life.

*Data Science Nuggets*

**At the supermarket, we often buy things that go together.**

When we want to cook spaghetti, we buy pasta and, of course, we need tomato sauce as well. And while we’re drinking our beer, we munch on chips, so there’s got to be chips with the beer in our cart. Supermarkets, more or less, have read our purchasing patterns—and they’ve got this down to a science.

Even tech companies know this all too well. Amazon, for instance, created an algorithm to exploit these relationships in their recommendation engines. The internet-based retailer giant does this by looking for items that are “frequently bought together” and items that “customers who bought this item also bought.”Looking for relationships between pairs or sets of items that tend to be purchased together is a data mining technique known as**market basket analysis**. It’s been used to study customer purchasing patterns—and it’s also been applied to examine whether library users borrow books under the same subject category.But can this technique tell us something about our political purchasing pattern—that is, say, our choices for senators in the last election? Did our choices also come in pairs? More importantly, what can this technique tell us about our current political system? The short answer to all these questions is it can.What I found out is that this technique can help strategists and analysts mine for interesting associations between candidates, and assess the level of same-party and cross-party associations. To the politically inclined, the finding resurfaces one sad reality about our political party system: it’s weak to marshal the electorate to vote along party lines.

In time of elections, political party acts as mere affiliation rather than a formidable electoral machinery.

For this analysis, I used precinct-level election returns from NAMFREL, transmitted as of May 12, 2016, 3:45PM. This dataset includes vote tallies from 90,357 precincts or around 98% of the total.

How do we apply this technique to our data? First, we must translate election returns into a market basket framework. To do this, we take the top 12 senatorial candidates in each precinct as the market basket for that precinct.This gives us 90,357 transactions (precincts) each made up of 12 items (candidates). What’s surprising is that all 50 senatorial candidates are included in this set of transactions. This means that all of the candidates made it to the top 12 of some precinct—even the candidate with the smallest number of total votes!In market basket analysis, we look for collections of**association rules**and analyze them. Association rules specify patterns in sets of items (also called

*itemsets*) that are often purchased togetherFor example, the association rule “{beer} -> {chips}” means “if beer is purchased, then chips is also likely to be purchased,” while “{beer, chips} -> {dip}” means “if beer and chips are purchased together, then dip is also likely to be purchased.” The challenge at hand is to surface the most interesting and significant association rules from a given transaction database.There are three statistical measures to assess the significance of an association rule, namely:

**support**,

**confidence**, and

**lift**. Support tells us how frequently the items in an association rule appear in transactions, while confidence and lift are two different measures of how significant or strong the association between items is.Now, let’s apply these measures to our dataset. First, the

**support**of an itemset or rule simply measures how often it appears in the data:

## Support(

X) = count(X)/N,

where *N *is the number of transactions in the database and *count(X) *is the number of transactions containing itemset *X*. In this case, I want to find out how often one candidate appears in the market basket per precinct.

*This indicates that Sotto garnered a disproportionately large number of votes from a smaller group of precincts, compared with the other top candidates*.) We find that the top 12 all have support greater than 50%. This means that they all made it to the top 12 in a majority of the precincts—even if none of them garnered a majority of the votes. This finding has important implication in the next elections. For future candidates, getting into the circle of 12 senators in the majority of precincts seems to be an important prerequisite to winning.

Relying on a handful of vote-rich areas or bailiwicks is not a good strategy.

The bottom three candidates in the top 20— Petilla, Lapid, and Colmenares—all garnered around 15% of the total votes, but their support values were 14% for Petilla, 9% for Lapid, and 5% for Colmenares. This tells us that Colmenares’ votes were more concentrated in fewer precincts.

We can also calculate for the support for the senatorial slate of a given party. Only two parties—Liberal Party (LP) and United Nationalist Alliance (UNA)—have fielded more than two candidates: There were 8 candidates who ran under LP while 6 candidates were affiliated with UNA. We find that the full LP slate made the top 12 in votes in 1% of the precincts (1 out of every 100), while the full UNA slate placed in the top 12 slots only in 0.06% of the precincts (6 out of every 10,000).What this finding seems to indicate is that voters are very unlikely to vote based on party principles or support political party wholesale, reflecting our weak party system.

Voters, it seems, look at candidates as individuals more than by their party associations.

Now, to measure the predictive power or accuracy of an association rule, we calculate for its **confidence**, defined as:

Confidence(X->Y) = Support(X,Y)/Support(X).

This means that the confidence of a rule *X* -> *Y* is the support of the itemset containing both *X* and *Y* divided by the support of the itemset containing only *X*. In other words, the higher the proportion of transactions where the presence of itemset *X *results in the presence of itemset *Y*, the higher the value of the confidence will be.

The associations between these candidates may be partly related to coalition-related campaign efforts.

But we note, as well, that high confidence values are, by definition, driven by the popularity of individual candidates (i.e., with high support). The third statistical measure, which we will discuss next, attempts to disentangle this effect.

Like confidence,**lift**measures the predictive power of a rule; unlike confidence, lift accounts for the fact that more itemsets that occur more frequently in the dataset will tend to occur more frequently in association rules as well. By its definition, lift allows one to find interesting rules that involve items and itemsets with relatively low support. Lift is defined as:

Lift(X -> Y) = Confidence(X -> Y)/Support(Y) = Support(X,Y)/(Support(X)*Support(Y))

In short, lift measures how much more likely one itemset is purchased relative to its typical rate of purchase, given that you know another itemset has been purchased.

Rules with lift values higher than 1 are interesting because these relationships occur more than expected by chance, given the frequency of occurrence of the itemsets involved. We look for pairs of candidates*X*and

*Y*with the highest values for lift(

*X*->

*Y*). There are 12 pairings with a lift greater than 1.025, involving 14 candidates. Figure 3 shows the pair connections (indicated by lines and their lift values), which can be grouped into four separate graphs.The highest lift pairings involve Manny Pacquiao (UNA) and three candidates from other parties: Mark Lapid (Aksyon), Martin Romualdez (Lakas), and Jericho Petilla (LP)—with lifts of 11%, 10%, and 7%, respectively. It’s interesting to see associations of candidates with relatively lower popularity—Lapid, Romualdez, and Petilla are in the top 20, but not in the top 12. One may speculate on the reasons behind these associations: for example, does the celebrity status of Pacquiao and Lapid have to do with their pairing?Moderately high lifts are associated with two LP candidate pairings, both involving Kiko Pangilinan (LP), who is paired with Leila de Lima (5% lift) and Ralph Recto (3% lift). Two independent candidates are paired in Sergio Osmena III and Juan Miguel Zubiri (5% lift).The richest graph consists of five candidates connected in six ways: Dick Gordon (IND) appears in the most number of pairings (4), Vicente Sotto (NPC) appears in three, Sherwin Gatchalian (NPC) and Ping Lacson (IND) both appear in two, and Risa Hontiveros (Akbayan) appears in one. Aside from the Liberal Party pairings, the only other same-party pairing is that of Sotto and Gatchalian.

Taken together, these initial observations illustrate how techniques and concepts from market basket analysis can be used for mining voting patterns in a senatorial race. Interesting questions—say, from avid observers of electoral politics and the senatorial campaigns—can be expressed in the language of

This indicates that party association, while important, is not the main driver for candidates being voted together.

*itemsets*,

*association rules*and measures of

*support*,

*lift*, and

*confidence*, as we have done here.In this brief analysis, we found that while party and coalition associations partly drive the market basket selection of voters, there are just as many—if not more—significant cross-party associations that are driven by something else.Geography may be a significant factor. To assess the effect of regional associations, one can combine the above with a geospatial analysis, using the location information on the precincts.That would be another study, for another column.

*[Ed. Note: The scripts used for this analysis is available at the author’s Github account: https://github.com/reinareyes.]*

Reina Reyes is a scientist, public speaker, and writer. She obtained her Ph.D. in Astrophysics from Princeton University in 2011. She currently works as a data science consultant in Manila. You may visit her website at: www.reinareyes.com.

Good day! The article is great because the nature of election in the country has been backed/proved by data analysis. This further strengthen my desire to learn data analysis.