File input and output
Table of Contents
Reading in files
First, we'll define the file name as a string that specifies where the file exists on the system.
infile = '../data/weights.csv'
Note: In the path of the file name, .. stands for "go one
directory up from the current directory". So in the above case,
weights.csv is in the data directory that is one directory up.
.
|_ current directory
| |
| |_ current python file
|
|_ data
|
|_ weights.csv
In order to work with files, a file handle needs to be created. This is
done using open.
infh = open(infile)
Once the file handle is created, there are several ways to read in the contents.
read
infh = open(infile) file_content = infh.read() print(file_content) infh.close() # File should be closed after reading.
person,weight John,201 Sue,120 Paul,150
As can be seen in the results, the contents are stored as a single
string (with new lines indicated by \n).
We could separate these lines into a list by using split.
infh = open(infile) file_content = infh.read() file_lines = file_content.split('\n') print(file_lines) infh.close()
['person,weight', 'John,201', 'Sue,120', 'Paul,150', '']
While this works fine, this is such a common operation that there is a built-in method for it.
readlines
To get the file contents into a list separated by new lines, readlines
can be used.
infh = open(infile) file_lines = infh.readlines() print(file_lines) infh.close()
['person,weight\n', 'John,201\n', 'Sue,120\n', 'Paul,150\n']
Using a for loop
A common method for reading in file contents is to do it one line at a time. This is particularly useful for large files that are too big to fit into memory.
infh = open(infile) for line in infh: print('current line: ' + line.strip()) infh.close()
current line: person,weight current line: John,201 current line: Sue,120 current line: Paul,150
Note: strip is used above to remove the new line (\n) at the end
of the line
Writing files
Writing files is very similar to reading files, but there are a few
differences. Again, a file handle is created using open, but the
second argument to open must be w. This tells open the file is to be
written to, as opposed to reading. Another difference is that the file
that you a writing to does not need to exist on your file system (in
fact, it usually does not).
WARNING: If you open a file for writing that already exists, its content will be overwritten.
outfile = 'test-outfile.txt' outfh = open(outfile, 'w') for number in range(3): outfh.write('this is line {}\n'.format(number)) outfh.close()
The would result in a file (test-outfile.txt) that looks like this:
this is line 0 this is line 1 this is line 2
A better way to close files
In all the examples above, the file needs to be explicitly closed
after opening it. Instead, we could've used a with statement, which
provides a convenient way to deal with the opening and closing of
files.
This is how the first read example would look using a
with statement.
with open(infile) as infh: file_content = infh.read()
The with takes care of closing the file once the current level
(marked by indentation) is exited. If a w is passed to open, the
exact same method can be used for writing files.
Tasks
Getting gene coordinates from a gencode file
The file data/gencode-v10-50random.gtf contains 50 random protein-coding genes from gencode version 10. Another file, data/genes-5random.txt, lists the names of 5 genes. Write a python script that reads in these genes of interest and finds the coordinates using the gencode gtf file.
Filter gene coordinates by size
Write a script that prints out all genes in
data/gencode-v10-50random.gtf that span more than 100 kb. If this is too
easy, use write instead of print to save the gene coordinates to a
file. If this gives you 14 genes, you're probably in good shape.
Merge program information
(from Sirisha)
data/program-versions.csv lists the names of a few programs and their current version. data/program-dates.csv lists the names of the same programs and their release date. Write a script that merges this information into a single file that has three columns. It would look something like this:
Program,Date,Version Firefox,July,12 ...