A NamedTupleReader for CSV files

Python has a csv module that can handle a wide variety of csv formats. Let’s explore this, then see if we can augment it a little.

First, let’s take a simple csv file. This one, eans.csv, relates bar codes (EANs) to their product codes and descriptions.

eans.csv
ean,product_code,description
5123412341234,FOOD123,"Tin o'Beans"
5123412341235,FOOD235,"Basic Soup"
5123412341236,FOOD236,"Basic Soup"
5123412341236,CARS345,"Big Wide Tyres"
5123412341237,CARS346,"Starter Motor"

We can read this file using csv.reader().

import csv

with open('eans.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print(row)

This gives us each row including the header row as a list. So, if we want to access an individual field then we need to use its list index, eg, row[0] would give us the EAN field, which isn’t particularly readable.

Using csv.DictReader() we can get each row as a dictionary whose fields are taken from the fieldnames in the first row. This lets us access individual fields with row[fieldname].

import csv

with open('eans.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print('         EAN {}'.format(row['ean']))
        print('Product Code {}'.format(row['product_code']))
        print(' Description {}'.format(row['description']))

But we can do better. It would be nicer to be able to type row.ean rather than row[‘ean’]. And that’s precisely what we can do with collections.namedtuple.

import csv
from collections import namedtuple

with open('eans.csv') as csvfile:
    reader = csv.reader(csvfile)
    # Read the fieldnames from the first row and use them to construct a namedtuple.
    fields = next(reader)
    EAN = namedtuple('EAN', fields)
    for row in reader:
        record = EAN(*row)
        print('         EAN {}'.format(record.ean))
        print('Product Code {}'.format(record.product_code))
        print(' Description {}'.format(record.description))

Using namedtuple also has some memory advantages as they don’t have per-instance dictionaries.

However, the example above has some flaws. It won’t work if a header field contains spaces, nor if it is a keyword, nor if it is an invalid identifier, as shown in the following csv file, eans2.csv.

ean,product code,class, description
5123412341234,123,FOOD,"Tin o'Beans"
5123412341235,234,FOOD,"Basic Soup"
5123412341236,236,FOOD,"Basic Soup"
5123412341236,345,CARS,"Big Wide Tyres"
5123412341237,346,CARS,"Starter Motor"

We can fix that.

First, let’s get rid of the spaces. Use str.strip() to remove leading and trailing spaces, then use str.replace() to replace spaces between words with underscores.

    # Remove leading and trailing spaces, then replace any remaining spaces with underscores.
    fields = (f.strip() for f in fields)
    fields = (f.replace(' ', '_') for f in fields)

Next, find fieldnames that aren’t valid identifiers using str.isidentifier() and prefix them with a valid identifier.

    # Prefix non-identifiers with 'field_'
    fields = ('field_' + f if not f.isidentifier() else f for f in fields)

Then we look for keywords using keyword.iskeyword() and append an underscore to them.

    # Append '_' to keywords.
    fields = ((f + '_' if keyword.iskeyword(f) else f) for f in fields)

Finally, we have one question remaining, and that’s how do we deal with a fieldname that has fallen through the cracks and is still not a valid identifier. One option, shown below, uses namedtuple‘s rename parameter to replace invalid identifiers with a positional identifier, eg, ‘_0’ for the first field.

    # And the rename=True gives us a fallback.
    self.record_type = namedtuple(self.name, fields, rename=True)

Now we can put it all together in NamedTupleReader. This acts much like csv.DictReader() but each row is a namedtuple rather than a dict.

import csv
import keyword
from collections import namedtuple


class NamedTupleReader:
    def __init__(self, f, name='Record', dialect='excel', *args, **kwargs):
        self.reader = csv.reader(f, dialect, *args, **kwargs)
        self.record_type = None
        self.name = name

    def __iter__(self):
        return self

    def __next__(self):
        # Define the record type based on the first row.
        if self.record_type is None:
            fields = next(self.reader)

            # Remove trailing spaces.
            fields = (f.strip() for f in fields)

            # Remove spaces.
            fields = (f.replace(' ', '_') for f in fields)

            # Prefix non-identifiers with 'field_'.
            fields = ('field_' + f if not f.isidentifier() else f for f in fields)

            # Append '_' to keywords.
            fields = ((f + '_' if keyword.iskeyword(f) else f) for f in fields)

            # And the rename=True gives us a fallback.
            self.record_type = namedtuple(self.name, fields, rename=True)

        # Skip blank rows.
        row = next(self.reader)
        while row == []:
            row = next(self.reader)

        return self.record_type(*row)

We could take this further by parameterising things like fieldname replacement for non-identifiers, or allowing fieldnames to be overridden, but this covers most common uses.

You can find NamedTupleReader in this gist.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s