Pandas: Reading, viewing, and selecting data
Table of Contents
Reading in data
First, import pandas.
import pandas as pd
Pandas has a number of functions for reading in different types of
files. You can see these (in IPython) by typing pd.read<TAB>. Remember
that you can type ? after the name to get more information. The
function we will use is read_csv (pd.read_csv?).
Before doing anything with Pandas, let's look at the file we will be
reading in (below I use the head command to view the first 4 lines of
the file in the terminal, but you could just open up the file in a text
editor).
head -4 ../data/qpcr.csv
Well,Sample_Name,Detector_Name,Reporter,Ct,Threshold 172,RMB200_A1C,FENG3_1,SYBR,26.44,0.39 173,RMB200_A1C,FENG3_1,SYBR,27.38,0.39 174,RMB200_A1C,FENG3_1,SYBR,27.76,0.39
The file is comma-separted, with the first row as the column. To read it
into a data frame, we pass the filename to read_csv.
qdat = pd.read_csv('../data/qpcr.csv')
Now qdat contains all the data.
Viewing data
Pandas provides many convenient ways to view and summarize data. The
info method will give a nice summary of the data frame.
qdat.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 254 entries, 0 to 253 Data columns (total 6 columns): Well 254 non-null int64 Sample_Name 254 non-null object Detector_Name 254 non-null object Reporter 254 non-null object Ct 254 non-null object Threshold 254 non-null float64 dtypes: float64(1), int64(1), object(4) memory usage: 13.9+ KB
This tells us things like the column names, how many rows we have, and the data types present.
You'll notice that Pandas automatically took the first row in the csv
file as the column names. This behavior can be controlled by passing
optional arguments to read_csv.
To view the first few rows of the data frame, use the head method.
qdat.head()
Well Sample_Name Detector_Name Reporter Ct Threshold 0 172 RMB200_A1C FENG3_1 SYBR 26.44 0.39 1 173 RMB200_A1C FENG3_1 SYBR 27.38 0.39 2 174 RMB200_A1C FENG3_1 SYBR 27.76 0.39 3 187 RMB200_AMY FENG3_1 SYBR 27.83 0.39 4 188 RMB200_AMY FENG3_1 SYBR 28.94 0.39
Similary, you can use tail to view the end of the data frame.
qdat.tail()
Well Sample_Name Detector_Name Reporter Ct Threshold
249 125 RMB200_VFC RPL32 SYBR 20.91 0.4
250 126 RMB200_VFC RPL32 SYBR 20.95 0.4
251 349 NTC RPL32 SYBR Undetermined 0.4
252 350 NTC RPL32 SYBR Undetermined 0.4
253 351 NTC RPL32 SYBR 36.60 0.4
Notice the "Undetermined" values. In the next tutorial, we'll cover how to mark these so that Pandas knows that they are missing values.
Selecting data
Selecting columns
There are various ways to select different parts of the data frame.
Let's say we wanted to access the "Sample_Name" column. There are two ways to do this.
qdat['Sample_Name']
0 RMB200_A1C
1 RMB200_A1C
2 RMB200_A1C
3 RMB200_AMY
4 RMB200_AMY
5 RMB200_AMY
6 RMB200_CBL
7 RMB200_CBL
8 RMB200_CBL
9 RMB200_DFC
10 RMB200_DFC
11 RMB200_DFC
12 RMB200_HIP
13 RMB200_HIP
14 RMB200_HIP
15 RMB200_IPC
16 RMB200_IPC
17 RMB200_IPC
18 RMB200_ITC
19 RMB200_ITC
20 RMB200_ITC
21 RMB200_M1C
22 RMB200_M1C
23 RMB200_M1C
24 RMB200_MFC
25 RMB200_MFC
26 RMB200_MFC
27 RMB200_OFC
28 RMB200_OFC
29 RMB200_OFC
...
224 RMB200_M1C
225 RMB200_M1C
226 RMB200_M1C
227 RMB200_MFC
228 RMB200_MFC
229 RMB200_MFC
230 RMB200_OFC
231 RMB200_OFC
232 RMB200_OFC
233 RMB200_S1C
234 RMB200_S1C
235 RMB200_S1C
236 RMB200_STC
237 RMB200_STC
238 RMB200_STC
239 RMB200_STR
240 RMB200_STR
241 RMB200_STR
242 RMB200_THM
243 RMB200_THM
244 RMB200_THM
245 RMB200_V1C
246 RMB200_V1C
247 RMB200_V1C
248 RMB200_VFC
249 RMB200_VFC
250 RMB200_VFC
251 NTC
252 NTC
253 NTC
Name: Sample_Name, dtype: object
Because there are more rows than can be conveniently printed to the screen, Pandas snips out a lot of the values in the middle.
The other way to access a column name is as an attibute to the data frame, assuming that it is a valid attribute name in Python (no spaces, hyphens, or periods).
qdat.Sample_Name
If we want two columns, we use a list of column names.
qdat[['Sample_Name', 'Ct']].head()
Sample_Name Ct 0 RMB200_A1C 26.44 1 RMB200_A1C 27.38 2 RMB200_A1C 27.76 3 RMB200_AMY 27.83 4 RMB200_AMY 28.94
This returns a data frame with two columns (and head is used to view
the first 5 rows).
Selecting both rows and columns
For selections involving both rows and columns, either ix, loc, or
iloc can be used. ix supports both label (name) indexing and
position indexing, while loc supports only labels and iloc supports
only integers. I tend to prefer loc and iloc because they are more
explicit.
Let's first access rows and columns using labels. For the current data
frame, the row labels are same as the row positions, but the row
labels and positions do not need to match. The row labels just have to
be a unique identifier. To make this clear, we'll switch to using the
well number, which is a unique identifier, as the row labels (called
the "index"). (We could have taken care of this when initially reading
in the data with read_csv).
qdat.set_index("Well", inplace=True)
qdat.head()
Sample_Name Detector_Name Reporter Ct Threshold
Well
172 RMB200_A1C FENG3_1 SYBR 26.44 0.39
173 RMB200_A1C FENG3_1 SYBR 27.38 0.39
174 RMB200_A1C FENG3_1 SYBR 27.76 0.39
187 RMB200_AMY FENG3_1 SYBR 27.83 0.39
188 RMB200_AMY FENG3_1 SYBR 28.94 0.39
To select well 172 and 188 with the "Sample_Name" and "Ct" columns, we
can use ix.
qdat.loc[[172, 187], ["Sample_Name", "Ct"]]
Sample_Name Ct
Well
172 RMB200_A1C 26.44
187 RMB200_AMY 27.83
To get back all rows for these two columns, : can be be used.
qdat.loc[:, ["Sample_Name", "Ct"]]
Sample_Name Ct
Well
172 RMB200_A1C 26.44
173 RMB200_A1C 27.38
174 RMB200_A1C 27.76
187 RMB200_AMY 27.83
188 RMB200_AMY 28.94
189 RMB200_AMY 28.91
190 RMB200_CBL 28.04
191 RMB200_CBL 27.95
192 RMB200_CBL 27.89
25 RMB200_DFC 27.90
26 RMB200_DFC 27.90
27 RMB200_DFC 27.29
184 RMB200_HIP 27.40
185 RMB200_HIP 28.04
186 RMB200_HIP 28.50
43 RMB200_IPC 28.08
44 RMB200_IPC 28.18
45 RMB200_IPC 27.08
46 RMB200_ITC 28.39
47 RMB200_ITC 28.33
48 RMB200_ITC 28.20
34 RMB200_M1C 27.97
35 RMB200_M1C 27.28
36 RMB200_M1C 26.93
31 RMB200_MFC 27.94
32 RMB200_MFC 27.90
33 RMB200_MFC 28.31
37 RMB200_OFC 28.08
38 RMB200_OFC 28.29
39 RMB200_OFC 27.68
... ...
130 RMB200_M1C 21.28
131 RMB200_M1C 20.98
132 RMB200_M1C 21.42
127 RMB200_MFC 20.68
128 RMB200_MFC 20.58
129 RMB200_MFC 20.79
133 RMB200_OFC 19.58
134 RMB200_OFC 20.67
135 RMB200_OFC 20.63
136 RMB200_S1C 20.84
137 RMB200_S1C 20.73
138 RMB200_S1C 20.84
265 RMB200_STC 20.11
266 RMB200_STC 20.60
267 RMB200_STC 21.16
274 RMB200_STR 20.76
275 RMB200_STR 20.16
276 RMB200_STR 20.75
277 RMB200_THM 20.89
278 RMB200_THM 21.12
279 RMB200_THM 20.65
271 RMB200_V1C 20.13
272 RMB200_V1C 20.36
273 RMB200_V1C 20.38
124 RMB200_VFC 20.89
125 RMB200_VFC 20.91
126 RMB200_VFC 20.95
349 NTC Undetermined
350 NTC Undetermined
351 NTC 36.60
[254 rows x 2 columns]
If we wanted to select these by positions instead, we should use iloc.
qdat.iloc[[0, 3], [0, 3]] ## [[rows], [cols]]
Sample_Name Ct
Well
172 RMB200_A1C 26.44
187 RMB200_AMY 27.83
If the positions happen to be consecutive, slice notation can be used instead. For example, to get the first 3 rows and the first 2 columns, we could use:
qdat.iloc[:3, :2]
Sample_Name Detector_Name
Well
172 RMB200_A1C FENG3_1
173 RMB200_A1C FENG3_1
174 RMB200_A1C FENG3_1
Selecting rows using a boolean vector
Often you want to select all rows that have a specific value in a column. For example, the "Detector_Name" column has 5 unique values.
qdat.Detector_Name.unique()
array(['FENG3_1', 'FENG3_2', 'FENG3_3', 'FENG3_4', 'RPL32'], dtype=object)
How would we limit the data frame to just the rows that have "RPL32"? First, we create a boolean vector.
rpl_bool = qdat.Detector_Name == 'RPL32'
Now that we have a True/False list for each row, we can use it to
select the rows that have True as their value (corresponding to the
rows that have "RPL32 in the "Detector_Name" column).
rpl = qdat[rpl_bool]
rpl.Detector_Name.unique()
array(['RPL32'], dtype=object)