File input and output

Table of Contents

Reading in files

First, we'll define the file name as a string that specifies where the file exists on the system.

infile = '../data/weights.csv'

Note: In the path of the file name, .. stands for "go one directory up from the current directory". So in the above case, weights.csv is in the data directory that is one directory up.

.
|_ current directory
|   |
|   |_ current python file
|
|_ data
    |
    |_ weights.csv

In order to work with files, a file handle needs to be created. This is done using open.

infh = open(infile)

Once the file handle is created, there are several ways to read in the contents.

read

The first way is read.

infh = open(infile)
file_content = infh.read()
print(file_content)
infh.close()  # File should be closed after reading.
person,weight
John,201
Sue,120
Paul,150

As can be seen in the results, the contents are stored as a single string (with new lines indicated by \n).

We could separate these lines into a list by using split.

infh = open(infile)
file_content = infh.read()
file_lines = file_content.split('\n')
print(file_lines)
infh.close()
['person,weight', 'John,201', 'Sue,120', 'Paul,150', '']

While this works fine, this is such a common operation that there is a built-in method for it.

readlines

To get the file contents into a list separated by new lines, readlines can be used.

infh = open(infile)
file_lines = infh.readlines()
print(file_lines)
infh.close()
['person,weight\n', 'John,201\n', 'Sue,120\n', 'Paul,150\n']

Using a for loop

A common method for reading in file contents is to do it one line at a time. This is particularly useful for large files that are too big to fit into memory.

infh = open(infile)
for line in infh:
    print('current line: ' + line.strip())

infh.close()
current line: person,weight
current line: John,201
current line: Sue,120
current line: Paul,150

Note: strip is used above to remove the new line (\n) at the end of the line

Writing files

Writing files is very similar to reading files, but there are a few differences. Again, a file handle is created using open, but the second argument to open must be w. This tells open the file is to be written to, as opposed to reading. Another difference is that the file that you a writing to does not need to exist on your file system (in fact, it usually does not).

WARNING: If you open a file for writing that already exists, its content will be overwritten.

outfile = 'test-outfile.txt'
outfh = open(outfile, 'w')

for number in range(3):
    outfh.write('this is line {}\n'.format(number))

outfh.close()

The would result in a file (test-outfile.txt) that looks like this:

this is line 0
this is line 1
this is line 2

A better way to close files

In all the examples above, the file needs to be explicitly closed after opening it. Instead, we could've used a with statement, which provides a convenient way to deal with the opening and closing of files.

This is how the first read example would look using a with statement.

with open(infile) as infh:
    file_content = infh.read()

The with takes care of closing the file once the current level (marked by indentation) is exited. If a w is passed to open, the exact same method can be used for writing files.

Tasks

Getting gene coordinates from a gencode file

The file data/gencode-v10-50random.gtf contains 50 random protein-coding genes from gencode version 10. Another file, data/genes-5random.txt, lists the names of 5 genes. Write a python script that reads in these genes of interest and finds the coordinates using the gencode gtf file.

Filter gene coordinates by size

Write a script that prints out all genes in data/gencode-v10-50random.gtf that span more than 100 kb. If this is too easy, use write instead of print to save the gene coordinates to a file. If this gives you 14 genes, you're probably in good shape.

Merge program information

(from Sirisha)

data/program-versions.csv lists the names of a few programs and their current version. data/program-dates.csv lists the names of the same programs and their release date. Write a script that merges this information into a single file that has three columns. It would look something like this:

Program,Date,Version
Firefox,July,12
...

Released under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Created with Emacs 24.4.1 (Org mode 8.3beta)

Validate