Grouping data

Suppose you are faced with the task of taking a list of things and grouping them by some attribute, like turning the following list country-city pairs…

input=[("USA", "New York"), ("USA", "San Francisco"), ("USA", "Los Angeles"),
      ("UK", "London"), ("UK", "Manchester"), ("UK", "Edinborough"),
      ("Spain", "Madrid"), ("Spain", "Barcelona"), ("Spain", "Granada"),
      ("Finland", "Helsinki"), ("Finland", "Oulu"), ("Finland", "Kuopio")]

into lists of cities grouped by country like this:

output=[("USA", ["New York", "San Francisco", "Los Angeles"]),
	("UK", ["London", "Manchester", "Edinborough"]),
	("Spain", ["Madrid", "Barcelona", "Granada"]),
	("Finland", ["Helsinki", "Oulu", "Kuopio")]

Also suppose that the application that you are working on is about preparing data for reports and that a lot of the application will be dedicated to creating such groupings. Often the data is accumulated by summing numbers instead of building lists. Sometimes groups have to be divided into subgroups, like regions within countries. You had better have some kind of a generic pattern for this grouping algorithm instead of inventing a new way of doing each time, right?

One the worst ways of doing it I have seen is this:

country = None
isFirst = True
output  = []
cities = None

if len(input) > 0:
    for countryCity in input:
	if country != countryCity[0]:
	    if not isFirst:
		output.append((country, cities));

	    isFirst = False
	    country = countryCity[0]
	    cities = [countryCity[1]]
	else:
	    cities.append(countryCity[1])

    output.append((country, cities))

It gets the job done and didn’t take more than 5 minutes to write, but it’s brittle. The group is closed in two different places. It’s not immediately obvious what to change to group by multiple levels. If the processing logic was any longer doing this is like leaving a time bomb in the code base.

What I have found clearest in these cases is a nested loop like this:

while there is more input:
    open new group
    while there is more input and we're in the group:
	accumulate data for the group
	get next input
    close the group

An example in Python for the problem at hand would be something like this:

output = []
inputIter = iter(input)
countryCity = next(inputIter, None)
while countryCity != None:
    # open new group
    country = countryCity[0]
    cities = []

    while countryCity != None and country == countryCity[0]:
	# accumulate data for group
	cities.append(countryCity[1])
	countryCity = next(inputIter,None)

    # close group
    output.append((country, cities))

Here it’s pretty clear where a group is opened, where the data is accumulated, and where it is closed. Also it’s clear what to do if a second level of grouping is needed: the loop that accumulates data has to have this same structure inside. For example:

while countryCity != None:
    country = countryCity[0]
    regions = []
    # group by country
    while countryCity != None and country == countryCity[0]:
	region = countryCity[1]
	cities = []
	# group by region
	while countryCity != None and region == countryCity[1]:
	    cities.append(countryCity[2])
	    countryCity = next(inputIter,None)
	regions.append((region, cities))
    output.append((country, regions))