Pandas: Cleaning and operating on data
Table of Contents
In part 1 we covered reading and viewing data. We'll use the same data set to demonstrate how to carry out calculations in Pandas. If you would prefer an overview of some of the methods we'll use before seeing them in action, jump to Overview of methods for performing calculations and then come back.
Reading in and cleaning the data
First, let's read in the data and set the rows to the well column values.
import pandas as pd qdat = pd.read_csv('../data/qpcr.csv', na_values='Undetermined') qdat.set_index('Well', inplace=True)
This is similar to how we read in the data last time, except for the
na_values parameter. This tells Pandas to treat the string
"Undetermined" as an indicator of a missing value. If we didn't do this,
the "Ct" column would be read in as strings because "Undetermined" would
be treated as a string, and all the values of a column must be the same
type.
To view the rows that have null Ct values, we can use pd.isnull.
qdat[pd.isnull(qdat.Ct)]
Sample_Name Detector_Name Reporter Ct Threshold
Well
337 NTC FENG3_1 SYBR NaN 0.39
338 NTC FENG3_1 SYBR NaN 0.39
339 NTC FENG3_1 SYBR NaN 0.39
340 NTC FENG3_2 SYBR NaN 0.36
341 NTC FENG3_2 SYBR NaN 0.36
342 NTC FENG3_2 SYBR NaN 0.36
343 NTC FENG3_3 SYBR NaN 0.26
344 NTC FENG3_3 SYBR NaN 0.26
345 NTC FENG3_3 SYBR NaN 0.26
109 RMB200_OFC FENG3_4 SYBR NaN 0.37
346 NTC FENG3_4 SYBR NaN 0.37
347 NTC FENG3_4 SYBR NaN 0.37
349 NTC RPL32 SYBR NaN 0.40
350 NTC RPL32 SYBR NaN 0.40
All but one of the null values occur a "no template control" well.
We can remove the wells using dropna.
qdat = qdat.dropna(how='any') # drop if any of row's cell is null ## Could also use: qdat = qdat[pd.notnull(qdat.Ct)]
Now we have a data frame that looks like this:
qdat.head()
Sample_Name Detector_Name Reporter Ct Threshold
Well
172 RMB200_A1C FENG3_1 SYBR 26.44 0.39
173 RMB200_A1C FENG3_1 SYBR 27.38 0.39
174 RMB200_A1C FENG3_1 SYBR 27.76 0.39
187 RMB200_AMY FENG3_1 SYBR 27.83 0.39
188 RMB200_AMY FENG3_1 SYBR 28.94 0.39
Next, we can remove some columns we won't use. One of these is the "Threshold" column. Another one is "Reporter", which does not vary over the rows
qdat.Reporter.unique()
array(['SYBR'], dtype=object)
To drop columns, the drop method can be used. Setting axis to 1
tells drop that "Reporter" and "Threshold" are column names (axis=0
would indicate that they are row names).
qdat = qdat.drop(['Reporter', 'Threshold'], axis=1)
qdat.head()
Sample_Name Detector_Name Ct
Well
172 RMB200_A1C FENG3_1 26.44
173 RMB200_A1C FENG3_1 27.38
174 RMB200_A1C FENG3_1 27.76
187 RMB200_AMY FENG3_1 27.83
188 RMB200_AMY FENG3_1 28.94
Initial data exploration
Now that we have a clean data frame, let's get a better feel for how the data looks.
qdat.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 240 entries, 172 to 351 Data columns (total 3 columns): Sample_Name 240 non-null object Detector_Name 240 non-null object Ct 240 non-null float64 dtypes: float64(1), object(2) memory usage: 7.5+ KB
We can use nunique to get the number of unique values in a column and
unique to list the unique values.
qdat.Sample_Name.unique()
array(['RMB200_A1C', 'RMB200_AMY', 'RMB200_CBL', 'RMB200_DFC',
'RMB200_HIP', 'RMB200_IPC', 'RMB200_ITC', 'RMB200_M1C',
'RMB200_MFC', 'RMB200_OFC', 'RMB200_S1C', 'RMB200_STC',
'RMB200_STR', 'RMB200_THM', 'RMB200_V1C', 'RMB200_VFC', 'NTC'], dtype=object)
qdat.Detector_Name.unique()
array(['FENG3_1', 'FENG3_2', 'FENG3_3', 'FENG3_4', 'RPL32'], dtype=object)
It would be good to know how many replicates we have for each
sample/detector combination. We can do this by using the groupby
method. We pass groupby the column names that have the variables that
we want to use to group the data. The size method of the groupby
object is used to determine the number of wells in each group.
samp_det_grp = qdat.groupby(['Sample_Name', 'Detector_Name']) samp_gene_counts = samp_det_grp.size()
samp_gene_counts.head(10)
Sample_Name Detector_Name
NTC FENG3_4 1
RPL32 1
RMB200_A1C FENG3_1 3
FENG3_2 3
FENG3_3 3
FENG3_4 3
RPL32 3
RMB200_AMY FENG3_1 3
FENG3_2 3
FENG3_3 3
dtype: int64
Most of the sample/gene combinations seem to be in triplicate. Let's view all the combinations that aren't.
samp_gene_counts[samp_gene_counts != 3]
Sample_Name Detector_Name
NTC FENG3_4 1
RPL32 1
RMB200_OFC FENG3_4 2
RMB200_THM FENG3_1 2
dtype: int64
Performing calculations on the data frame
Pandas makes it easy to group the data and perform calculations on each group. In our case, we want to ΔΔCt for each sample and gene of interest.
In the first step, we need to normalize by the control detector, RPL32. We can create two data frames, one with RPL32 and one with the genes of interest, which appear to be variants of FENG3.
is_rpl = qdat.Detector_Name == 'RPL32' rpl = qdat[is_rpl] fengs = qdat[~is_rpl].copy() # ~ flips each True/False value
Each sample has replicates for RPL32 expression, so we want to group by the sample and take the mean Ct value.
rpl_grps = rpl.groupby('Sample_Name') rpl_means = rpl_grps['Ct'].mean() ## or as one step: rpl_means = rpl.groupby('Sample_Name').Ct.mean()
rpl_means
Sample_Name NTC 36.600000 RMB200_A1C 20.536667 RMB200_AMY 20.460000 RMB200_CBL 21.780000 RMB200_DFC 21.616667 RMB200_HIP 20.790000 RMB200_IPC 20.583333 RMB200_ITC 20.750000 RMB200_M1C 21.226667 RMB200_MFC 20.683333 RMB200_OFC 20.293333 RMB200_S1C 20.803333 RMB200_STC 20.623333 RMB200_STR 20.556667 RMB200_THM 20.886667 RMB200_V1C 20.290000 RMB200_VFC 20.916667 Name: Ct, dtype: float64
We want to associate each RPL32 mean with the samples in fengs. One
way to do this is to just add the values as a column. A function can be
applied (mapped) to each column value by using map, with the return
values assigned as a new column.
def get_rpl_mean(sample_name): return rpl_means[sample_name] fengs['rpl_mean'] = fengs.Sample_Name.map(get_rpl_mean)
Note: Above I define a simple function to map to all the values of the sample name column, but lambda expressions are often used for this purpose.
To calculate ΔCt, we just need to substract the "rpl_mean" column from the "Ct" column.
fengs['d_Ct'] = fengs.Ct - fengs.rpl_mean
fengs.head()
Sample_Name Detector_Name Ct rpl_mean d_Ct
Well
172 RMB200_A1C FENG3_1 26.44 20.536667 5.903333
173 RMB200_A1C FENG3_1 27.38 20.536667 6.843333
174 RMB200_A1C FENG3_1 27.76 20.536667 7.223333
187 RMB200_AMY FENG3_1 27.83 20.460000 7.370000
188 RMB200_AMY FENG3_1 28.94 20.460000 8.480000
We can use a similar approach to normalize all the ΔCt values by the mean expression of one sample type (we will use "RMB200_A1C"). We limit samples to A1C samples and then take the mean for each gene.
fengs_a1c = fengs[fengs.Sample_Name == 'RMB200_A1C'] a1c_means = fengs_a1c.groupby('Detector_Name')['d_Ct'].mean() def get_a1c_mean(detector_name): return a1c_means[detector_name] fengs['a1c_mean'] = fengs['Detector_Name'].map(get_a1c_mean)
To calculate ΔΔCt, we subtract this "a1c_mean" column from the ΔCt values.
fengs['dd_Ct'] = fengs.d_Ct - fengs.a1c_mean
fengs.head()
Sample_Name Detector_Name Ct rpl_mean d_Ct a1c_mean dd_Ct
Well
172 RMB200_A1C FENG3_1 26.44 20.536667 5.903333 6.656667 -0.753333
173 RMB200_A1C FENG3_1 27.38 20.536667 6.843333 6.656667 0.186667
174 RMB200_A1C FENG3_1 27.76 20.536667 7.223333 6.656667 0.566667
187 RMB200_AMY FENG3_1 27.83 20.460000 7.370000 6.656667 0.713333
188 RMB200_AMY FENG3_1 28.94 20.460000 8.480000 6.656667 1.823333
Finally, we can calculate 2-ΔΔCt.
fengs['two_dd_Ct'] = 2 ** (-1 * fengs['dd_Ct'])
fengs.head()
Sample_Name Detector_Name Ct rpl_mean d_Ct a1c_mean \
Well
172 RMB200_A1C FENG3_1 26.44 20.536667 5.903333 6.656667
173 RMB200_A1C FENG3_1 27.38 20.536667 6.843333 6.656667
174 RMB200_A1C FENG3_1 27.76 20.536667 7.223333 6.656667
187 RMB200_AMY FENG3_1 27.83 20.460000 7.370000 6.656667
188 RMB200_AMY FENG3_1 28.94 20.460000 8.480000 6.656667
dd_Ct two_dd_Ct
Well
172 -0.753333 1.685683
173 0.186667 0.878633
174 0.566667 0.675175
187 0.713333 0.609909
188 1.823333 0.282567
Overview of methods for performing calculations
Above we used a few important methods for performing calculations, but I did not completely cover any of these methods, and I left out a few methods entirely. Here is a high-level overview of a few different tools you can use.
- groupby
- Group data by criteria and perform operations on groups. This can include calculations on the data that keep the same shape as the input data set (transformations), those that produce a result for each group (aggregation), and those that remove groups that don't fit a certain criteria (filtering).
- map
- Apply a function to each element of a series. The series is
often a column of a data frame. This returns a series of the
same shape. To apply a function to each element of a data
frame,
applymapcan be used. - apply
- Apply a function to each row or column of a data frame. We
did not use this above, but this is important for
calculations that require information for more than one
element of a row or column. This can be combined with
groupby.
Each of these provides a way to efficiently operate on the data. Any time you need to iterate over a data frame or series, you should consider one of these methods.
Task
There is a set of tasks contained in code/array.ipynb that involve analyzing data/array.csv. The best way to get these files is by syncing with the git repo.
To start the notebook, go to that directory in a terminal shell and run
ipython notebook.