Functions

Table of Contents

The advantage of functions

Functions allow you to created reusable pieces of code. In their purest form, functions take some input, do some calculation, and then return an output.

Let's write a function that, given a gene's start position, returns a set of "promoter" coordinates. We'll coarsely define this as 1,000 bases upstream and 500 bases downstream of the start position.

So, if we have a set of coordinates for a gene of interest (chr7:73,095,248-73,097,781), we can get the start of the promoter by subtracting 1,000 from 73,095,248 and the end the promoter by adding 500 to the same value. But we also need to consider the strand. Because this gene is on the negative strand, we should add 1,000 to 73,097,781 to get the start position of the promoter and subtract 500 from the same value to get the stop position.

In code, this would look like this:

## details for the specific gene
genestart = 73097781
## promoter calculation (for case of negative strand)
promstart = genestart + 1000
promend = genestart - 500

Functions allow us to separate out this calculation logic from the details of a specific case, give a name to it, and then apply it throughout the code by the name.

Defining functions

Let's create a function named get_promoter. This is done using def.

def get_promoter():
    pass  ## Placeholder until we fill in the code

At this point, the function doesn't do anything. It doesn't take any arguments and it doesn't return any values.

A good place to start when defining a function is the docstring, which serves as documentation for what the function does.

def get_promoter():
    """Return promoter of gene."""
    pass  ## Placeholder until we fill in the code

Next, we should declare what arguments the function receives.

def get_promoter(start, strand):
    """Return promoter of gene.

    Parameters
    ----------
    start : int
        Start coordinate of gene
    strand : str
        Strand direction indicated by + or -
    """
    pass

The arguments are specified between the parentheses (everything added between the triple quotes is documentation that, while often helpful, is not required).

Calling functions

We can use the genestart defined above and the strand as input to our new function.

get_promoter(genestart, '-')

We have called the function, but it doesn't do anything yet. Let's add in the calculation.

def get_promoter(start, strand):
    """Return promoter of gene.

    Parameters
    ----------
    start : int
        Start coordinate of gene
    strand : str
        Strand direction indicated by + or -
    """
    if strand == '+':
        promstart = start - 1000
        promend = start + 500
    elif strand == '-':
        promstart = start + 1000
        promend = start - 500
    else:
        print('Strand not recognized.')
        # REV: Why is handling a bad argument like this problematic?
        #      What could be done instead?
    print(promstart, promend)

# Now call it.
promoter = get_promoter(genestart, '-')
print(promoter)
73098781 73097281
None

Looks like it is performing the calculation, but at this point it is only printing out the values. We know this because when we print the return value (promoter), it is None.

To get back the result, we can modify the function to include a return statement.

def get_promoter(start, strand):
    """Return promoter of gene.

    Parameters
    ----------
    start : int
        Start coordinate of gene
    strand : str
        Strand direction indicated by + or -
    """
    if strand == '+':
        promstart = start - 1000
        promend = start + 500
    elif strand == '-':
        promstart = start + 1000
        promend = start - 500
    else:
        print('Strand not recognized.')
        # REV: Why is handling a bad argument like this problematic?
        #      What could be done instead?
    return promstart, promend

# Now call it.
promoter = get_promoter(genestart, '-')
print(promoter)
(73098781, 73097281)

When we call the new function, the result is returned (because we've added the return statement at the end) and is no longer printed out (because we removed the print statement). The variable promoter now contains the returned results.

Note: Due to how the return statement was specified, a tuple has been returned. Any type can be returned. If square brackets were given around the results (return [promstart, promend]), a list would have been returned instead. For now, tuples can be thought of as very similar to lists, though there are some very important differences (particularly in terms of mutability, which we have not discussed).

Keyword arguments

In the above example, we have two required arguments: start and strand. If we call the function without an argument, we get an error.

get_promoter(genestart)
Traceback (most recent call last):
...
TypeError: get_promoter() missing 1 required positional argument: 'strand'

Required arguments are a good thing, but, in some cases, it is useful to allow optional arguments. For example, the 1,000 bases upstream and 500 bases downstream used to define the promoter are parameters that should probably be tunable, so we can provide them as default values that can be overridden when the function is called.

def get_promoter(start, strand, upstream=1000, downstream=500):
    """Return promoter of gene.

    Parameters
    ----------
    start : int
        Start coordinate of gene
    strand : str
        Strand direction indicated by + or -
    upstream, downstream : int
        Number of bases upstream/downstream of `start` used to
        define promoter
    """
    if strand == '+':
        promstart = start - upstream
        promend = start + downstream
    elif strand == '-':
        promstart = start + upstream
        promend = start - downstream
    else:
        print('Strand not recognized.')
        # REV: Why is handling a bad argument like this problematic?
        #      What could be done instead?
    return promstart, promend

Now we can call the function as before and get the same results, as we do in the first call below. If we choose to, we can override the default values. For example, in the second call, we bring the upstream bases down to 500. In the third call, both optional arguments are tweaked at the same time.

print(get_promoter(genestart, '-'))
print(get_promoter(genestart, '-', upstream=500))
print(get_promoter(genestart, '-', upstream=500, downstream=400))
(73098781, 73097281)
(73098281, 73097281)
(73098281, 73097381)

In the above examples, we provide the keywords with the value, but they can also be given without the keywords as long as the order is preserved.

print(get_promoter(genestart, '-', 500, 400))
(73098281, 73097381)

In this case, however, leaving out the keywords makes the function less readable.

Done

We now have a function that is reusable throughout the code. In addition to simplifying our code, this makes it easier to modify the code later if needed because the code only has to be changed in one place. This function can also be shared with other python files. We'll talk about this more when we go over modules.

Tasks

GC content

Write a function that calculates the GC content of a sequence. Then explore how to use assert to demonstrate that your function works as you expect.

In [1]: calculate_gc_ratio('GCAA')
0.5

Reverse complement function

Write a function that takes the reverse complement of a DNA sequence.

It should work something like this:

In [1]: reverse_complement('ATAATCG')
CGATTAT

Amino acid translation function

Write a function that translates a DNA sequence to an amino acid sequence.

In [1]: translate('GTACCC')
VP

It's up to you to decide how you want to deal with stop codons. I've already added a codon to amino acid dictionary in codon_mapping.py. You can access this in your script by importing it.

Write tests (either with plain assert statements or py.test) to show that your code is working properly.

Released under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Created with Emacs 24.4.1 (Org mode 8.3beta)

Validate